WO2021103973A1 - Sensitive gene discovery method, device and storage medium - Google Patents

Sensitive gene discovery method, device and storage medium Download PDF

Info

Publication number
WO2021103973A1
WO2021103973A1 PCT/CN2020/126658 CN2020126658W WO2021103973A1 WO 2021103973 A1 WO2021103973 A1 WO 2021103973A1 CN 2020126658 W CN2020126658 W CN 2020126658W WO 2021103973 A1 WO2021103973 A1 WO 2021103973A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
patient
data
model
training
Prior art date
Application number
PCT/CN2020/126658
Other languages
French (fr)
Chinese (zh)
Inventor
汤在祥
顾金成
曹建平
杨巍
聂继华
焦旸
Original Assignee
苏州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学 filed Critical 苏州大学
Publication of WO2021103973A1 publication Critical patent/WO2021103973A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • This application relates to a method, device and storage medium for discovering sensitive genes, and belongs to the field of biotechnology.
  • High-throughput sequencing technology also known as "next generation” sequencing technology, is marked by the ability to sequence hundreds of thousands to millions of DNA molecules at a time and the read length is relatively short. The rapid development of high-throughput sequencing technology has made it possible to predict the risk of disease.
  • Radiotherapy is a local treatment method that uses radiation of different energy to treat tumors. Its role and position in tumor treatment have become increasingly prominent, and it has become the main method for treating malignant tumors.
  • This application provides a sensitive gene discovery method, device and storage medium, which can solve the problem that genes that are sensitive to the target treatment cannot be determined, which leads to the inability to determine whether the patient is sensitive to the target treatment, and the treatment effect may be poor.
  • This application provides the following technical solutions:
  • a method for discovering sensitive genes includes:
  • the patient data set including at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted;
  • the model parameters of the gene discovery model including the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment means and each gene expression data, the gene discovery model Genes used to predict that the patient is sensitive to the target treatment;
  • the patient data set is used to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
  • the K-Fold-based cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data, include:
  • the patient data set is divided into N first data sets, where the u-th first data set is the verification set when the first data set is used for the u-th training, and the other first data sets are The training set when the first data set is used for training u times, the u takes an integer from 1 to N in turn, and the N is an integer greater than 1;
  • the corresponding training set is used to train the single-gene discovery model based on the K-Fold cross-validation algorithm again, and the single-gene discovery model after training is obtained, and the interaction of each gene is significant Probability
  • the training decision threshold is determined based on the trained single-gene discovery model and the number of training genes input to the gene discovery model.
  • the training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment method.
  • the k genes of each first patient are determined as genes sensitive to the target treatment; the k genes are based on all The first patient is determined corresponding to the gene input to the gene discovery model.
  • the method further includes:
  • the corresponding training set is used to train the single-gene discovery model again based on the K-Fold cross-validation algorithm to obtain the trained single-gene discovery model,
  • the significance probability of the interaction of each gene includes:
  • the single-gene discovery model is a gene discovery model corresponding to each gene
  • the parameters after fitting include the target treatment means effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment means and the gene expression data of the gene, and the significance probability of the interaction;
  • the p-th second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting.
  • v is a positive integer less than or equal to the N
  • the p is an integer from 1 to M in sequence
  • the M is an integer greater than 1.
  • the determining the training determination threshold based on the trained single-gene discovery model and the number of training genes input to the gene discovery model include:
  • the gene discovery model includes the post-training data corresponding to each gene Single-gene discovery model
  • the output result of the model is compared with the judgment threshold in the parameter combination to determine the first patient who is sensitive to the target treatment method and the second patient who is not sensitive to the target treatment method in the corresponding training set ;
  • the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.
  • the method further includes:
  • the genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than a preset significance value are screened out.
  • the gene discovery model is a proportional hazard regression model or a logistic model.
  • a sensitive gene discovery device comprising:
  • the data acquisition module is used to acquire a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether to use Target treatment
  • the model acquisition module is used to acquire a gene discovery model.
  • the model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data ,
  • the gene discovery model is used to predict genes that the patient is sensitive to the target treatment method;
  • the gene discovery module is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
  • a sensitive gene discovery device in a third aspect, includes a processor and a memory; the memory stores a program, and the program is loaded and executed by the processor to realize the sensitive gene described in the first aspect. Sex gene discovery method.
  • a computer-readable storage medium is provided, and a program is stored in the storage medium, and the program is loaded and executed by the processor to realize the sensitive gene discovery method described in the first aspect.
  • the beneficial effect of this application is that by acquiring a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether the target treatment is adopted
  • the model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data.
  • the gene discovery model is used Predict the genes that patients are sensitive to the target treatment; based on the K-Fold cross-validation algorithm, use the patient data set to fit the model parameters of the gene discovery model to determine the genes that are sensitive to the target treatment in multiple gene expression data; it can be resolved Unable to determine the genes that are sensitive to the target treatment, which leads to the problem that it is impossible to determine whether the patient is sensitive to the target treatment, and the treatment effect may be poor; it is possible to screen out sensitive genes with statistically significant interaction effects with the target treatment to construct a gene label; The use of gene tags can accurately predict patients who are sensitive to the target treatment.
  • Fig. 1 is a flowchart of a method for discovering sensitive genes provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of determining the number of training genes and the training judgment threshold provided by an embodiment of the present application
  • Fig. 3 is a flowchart of a method for discovering sensitive genes provided by another embodiment of the present application.
  • Figure 4 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application.
  • Fig. 5 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application.
  • Sensitivity to target treatment means It refers to the characteristic that patients can obtain better survival benefits under the condition of treatment with target treatment means.
  • Radiosensitivity Radiosensitivity
  • Radiotherapy Radiotherapy: referred to as radiotherapy, is one of the treatment methods for malignant tumors, which is to irradiate the tumor with rays of different energy to inhibit and kill cancer cells.
  • Gene signature A collection of genes, usually related to disease diagnosis and prognosis, and can be used to gain a deeper understanding of diseases from the perspective of genetic factors.
  • Cross-validation Mainly used in modeling applications. Most of the samples are selected as the training set from the given modeling samples, and the remaining part is used as the validation set. The training set is used to train the classifier. Then use the validation set to test the trained model to evaluate the pros and cons of the classifier.
  • Hazard ratio The ratio of the variable's risk rate at the exposure level to the risk rate at the non-exposure level. It can not only record the result of an event, but also reflect the time it takes for the event to occur.
  • Precision medicine It is based on the patient's internal biological information and clinical symptoms and signs to implement tailor-made plans for the patient's health care and clinical decision-making.
  • Probability of interaction significance Rejecting the null hypothesis interaction does not have a significant impact on survival, and accepting the alternative hypothesis interaction has a significant impact on survival. That is, the probability that the interaction has a significant impact on survival is a wrong approach.
  • the higher the significance of the interaction the smaller the P value of the significance probability of the interaction.
  • the smaller the P value of the significance probability of the interaction the lower the possibility that this approach is wrong.
  • the interaction terms between gene expression data and the target treatment are added to the model, and the significance probability of each effect including the interaction terms can be obtained.
  • Proportional hazards model Also known as Cox regression model, it is a semiparametric regression model proposed by British statistician D.R.Cox in 1972. This model can be used to describe the impact of multiple characteristics that do not change over time on the mortality rate at a certain moment. It is an important model in survival analysis.
  • Logistic regression model also known as logistic regression analysis, mainly used in epidemiology. For example, if you want to explore the risk factors of gastric cancer, you can choose two groups of people, one group is gastric cancer group, the other group is non-gastric cancer group, the two groups of people must have different physical signs and lifestyles.
  • Survival curve Take the follow-up time as the horizontal axis and the survival rate as the vertical axis, connecting the points to form a curve.
  • Log-rank test also known as log-rank test: used to compare survival data between groups. The test statistic is chi-square.
  • this application takes an electronic device as an example for the execution of each embodiment.
  • the electronic device may be a terminal device such as a computer, a tablet, or a mobile phone; alternatively, it may also be a server, etc.
  • This embodiment does not apply to electronic devices. The type is limited.
  • Fig. 1 is a flowchart of a method for discovering sensitive genes provided by an embodiment of the present application. The method includes at least the following steps:
  • Step 101 Obtain a patient data set.
  • the patient data set includes at least two sets of patient data.
  • Each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted.
  • the target treatment method may be radiotherapy; of course, it may also be other gene-related treatment methods, which is not limited in this embodiment.
  • Gene expression data reflects the abundance of gene transcript mRNA in cells measured directly or indirectly. These data can be used to analyze which genes have changed their expression, what are the correlations between genes, and how the activities of genes are affected under different conditions. They have important applications in medical clinical diagnosis, drug efficacy judgment, and revealing the mechanism of disease.
  • the methods for high-throughput detection of genomic mRNA abundance are mainly cDNA microarrays and oligonucleotide chips. With the development of high-throughput detection technologies such as cDNA microarrays and oligonucleotide chips, we can start from the whole genome level. Quantitative or qualitative detection of gene transcription product mRNA.
  • the gene expression data in each group of patient data includes: the patient's standardized multi-gene mRNA expression value two-dimensional data set, and the rows of the two-dimensional data set
  • the variables represent different patients (Table 1 takes 371 patients as an example), and the column variables represent the gene expression values of different genes (Table 1 takes the gene expression values of 14343 genes as an example).
  • Clinical data is used to indicate the patient's clinical performance.
  • the clinical data includes whether the target treatment is used or not.
  • clinical data also includes, but is not limited to: the patient's age at diagnosis for a certain disease, histological classification, pathological classification, and/or medication use.
  • the clinical data in each group of patient data shown in Table 2 includes: a two-dimensional clinical data set of various types of patients, and the row variables of the two-dimensional clinical data set represent different For patients (Table 2 takes 371 patients as an example), the column variables represent different types of clinical data (Table 2 takes 5 types of clinical data as examples).
  • histological classification is different, such as: for thyroid adenoma, histological classification includes: follicular adenoma (including simple adenoma and eosinophilic adenoma); papillary adenoma ; Medullary tumors and undifferentiated tumors; for lung cancer, histological classification includes: squamous cell carcinoma, adenocarcinoma, small cell carcinoma, large cell carcinoma, this embodiment does not limit the histological classification. For different types of diseases, the corresponding pathological classifications are different.
  • the pathological classifications of cirrhosis include: small nodular cirrhosis, large nodular cirrhosis, large and small nodular cirrhosis, incomplete separation
  • liver cirrhosis There are four types of liver cirrhosis; the pathological classification of fatty liver includes: simple fatty liver, steatohepatitis, fatty liver fibrosis, and fatty liver cirrhosis. This embodiment does not limit the pathological classification.
  • Survival data is used to indicate the survival of patients. Survival data includes the patient's survival outcome and the length of time that has elapsed. Among them, the survival outcome can be a survival risk value or a binary value (that is, survival or death). Based on the number of patients in Table 1 and Table 2, there are a total of 371 corresponding survival outcomes.
  • Step 102 Obtain a gene discovery model.
  • the model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data, and the gene discovery model Genes used to predict the patient's sensitivity to the target treatment.
  • the gene discovery model is a proportional hazard regression model or a logistic model.
  • the model output is the survival risk value; when the gene discovery model is a logistic model, the model output is a binary value.
  • h 0 (t) is the basic risk function
  • is the effect parameter of the target treatment method
  • r is used to indicate whether to accept the target treatment method, if the target treatment method is accepted, the value of r is 1; if the target treatment method is not accepted , Then the value of r is 0
  • x 1 , x 2 ,...x s are the gene expression data corresponding to each gene
  • b 1 , b 2 ,...b s are the gene effect parameters of each gene
  • i 1 , i 2 , ... i s treatment with the respective target gene interaction effects between the parameter data means for reflecting the target treatment effect on survival by the influence of the level of expression of the corresponding gene.
  • the hazard ratio (HR) is exp(r ⁇ + x j b j + rx j i j ). If the interaction effect parameter is a negative value, then the HR may be less than 1. At this time, the survival rate of patients who are sensitive to the target treatment is higher than the survival rate of patients who do not receive the target treatment. If some patients have genes that are sensitive to the target treatment, their total risk ratio (nHR) will tend to be very small, less than the preset threshold. Then, these patients with a relatively high expected survival rate are the first patients who are sensitive to the target treatment, and the subset of gene expression data entered into the gene discovery model is the gene expression data of sensitive genes that are sensitive to the target treatment.
  • the gene discovery model can be trained to obtain sensitive genes that are sensitive to the target treatment.
  • Step 103 Use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm, so as to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
  • the electronic device can use all the gene expression data of the patient to fit the model parameters of the gene discovery model.
  • the electronic device can perform a preliminary screening of all gene expression data of the patient to screen out genes that are obviously insensitive to the target treatment.
  • the patient The gene expression data in the data are sequentially input into the single gene discovery model; for each gene expression data, the survival data of the patient data is used as the output result of the single gene discovery model to determine the interaction effect parameters and the significance of the interaction effect corresponding to the gene expression data Probability: The genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than the preset significance value are screened out.
  • the single-gene discovery model refers to the model corresponding to each gene, and the type of its model parameter is the same as that of the gene discovery model.
  • the effect parameter of the target treatment in the gene discovery model is the average of the effect parameters of the target treatment in each single gene discovery model;
  • the gene effect parameter of each gene in the gene discovery model is the gene corresponding to the single gene discovery model.
  • Effect parameter; the interaction effect parameter of each gene in the gene discovery model is the interaction effect parameter of the single gene discovery model corresponding to the gene.
  • the electronic device uses the patient’s gene expression data to multiply the data on whether the target treatment method is used or not to obtain the interaction item between the target treatment method and each gene expression data; for each gene, construct survival data and target
  • a single-gene discovery model of interaction terms between treatment means, gene expression data, target treatment means and each gene expression data Take the gene expression data in Table 1 as an example. Since there are 14343 gene expression data, 14343 single-gene discovery models need to be constructed. Each single-gene discovery model can obtain various effect parameters, standard errors and interactions of effect parameters. Probability of significance.
  • the preliminary screening of sensitive genes that are sensitive to the target treatment includes: determining whether the interaction effect parameter is less than 0, and whether the interaction significance probability of the interaction item is less than the preset significant value (such as 0.05) is determined as the sensitive gene , That is, the genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than the preset significance value are screened out.
  • the preset significant value such as 0.05
  • the preset significant value is described by taking 0.05 as an example. In actual implementation, the preset significant value may also be other values, and this embodiment does not limit the value of the preset significant value.
  • the patient data set is used to fit the model parameters of the gene discovery model to determine the genes that are sensitive to the target treatment in multiple gene expression data, including:
  • the combined gene discovery model is used to calculate the nHR value; in the second training, one of the 9 patient data used for training for the first time is used as the validation set, and the 37 patient data for the first time as the validation set is added to the training set , And re-fit the gene discovery model to calculate the nHR value.
  • a total of 10 trainings and 10 exchanges of validation sets cover all patient data, resulting in a complete nHR value of 334 patients.
  • N is also another value, and this embodiment does not limit the value of N.
  • the K-Fold cross-validation algorithm is nested again during each cross-validation training process.
  • the training set corresponding to the vth time is divided into M second data sets; based on the K-Fold cross-validation algorithm, the second data set is used to perform M parameter fitting on the preset single-gene discovery model to obtain Parameters after fitting.
  • the parameters after fitting include the target treatment effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment method and the gene expression data of the gene, and the significance probability of the interaction;
  • the second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting, and v is less than or equal to N
  • p is an integer from 1 to M in sequence
  • M is an integer greater than 1.
  • the size of the M second data sets is the same or different.
  • the purpose of internal nesting training is to find the optimal parameter combination of the number of genes and the determination threshold.
  • each training uses 9 pieces of patient data for training and one piece of patient data for verification, but the data of the patient data is different. Nine copies are 301 patient data, and one copy is 33 patient data.
  • a single gene discovery model was constructed for survival data, target treatment methods, gene expression data, and interaction terms between target treatment methods and each gene expression data. Select the genes that have a negative value for the interaction effect, and arrange them in descending order of the significance probability of the interaction.
  • the training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment.
  • the parameter combination corresponding to the minimum value of the rank test result obtains
  • the range of the number of genes is [2, 150]
  • the range of the threshold value of the judgment threshold is [0.01, 0.5].
  • the effect parameters of each single gene discovery model are obtained. Apply each effect parameter to the gene discovery model, and calculate the nHR value of the patient under various parameter combinations in this outermost training (that is, using the first data set for training). According to the size between the nHR and the corresponding judgment threshold, it is judged whether the patient is the first patient who is sensitive to the target treatment method or the second patient who is not sensitive to the target treatment method.
  • the log-rank test of the survival curve of patients who used the target treatment method and the patient who did not use the target treatment method was compared.
  • the log-rank test has the smallest number of corresponding genes and the parameter combination with the optimal threshold. Referring to the number of training genes and the training determination threshold automatically selected by the electronic device shown in FIG. 2, the number of training genes is 33, and the training determination threshold is 0.01.
  • the model output HR(nHR) of the first g genes of the patient is calculated in the validation set of nested training as follows:
  • is the effect parameter of the target treatment
  • is the average value estimated by g (33 in Figure 2) single gene model. If the nHR is less than 0.01, the patient is classified as the first patient who is sensitive to the target treatment; if the nHR is greater than or equal to 0.01, the patient is classified as the second patient who is not sensitive to the target treatment. Since each patient will appear once in the verification set, all patients will be classified as the first patient or the second patient after the training process of steps 1 to 4.
  • each first patient obtained after training using the first data set N times determine the k genes of each first patient as genes that are sensitive to the target treatment; among them, the k genes are based on the first patient Corresponding to the gene determination input to the gene discovery model.
  • the first survival curve of the patients treated with the target treatment method among the first patients is compared with the first survival curve of the patients who do not use the target treatment method.
  • the process ends when the log-rank test result is greater than or equal to the preset significance level.
  • the survival curves of patients who received the target treatment with those who did not receive the target treatment at a specified significance level (ie, the preset significance level) through a log-rank test. If the survival condition is significantly improved (that is, the test result of the log-rank test result is less than the significant level), it indicates that the target treatment method is beneficial to the first patient, that is, the gene signature is effective, and the prediction of the first patient is also accurate.
  • This example predicts 371 patients, of which 174 belong to the first patient and 197 belong to the second patient.
  • the preset significance level can be 0.05, of course, it can also be other values, and this embodiment does not limit the value of the preset significance level.
  • the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.
  • the value of k can also be the average value of the number of training genes obtained when the first data set is used for training N times. This embodiment does not limit the value of k.
  • Step 31 Divide the patient data set into N parts, where N-1 part is the training set, and 1 part is the validation set;
  • Step 32 Divide the N-1 training set into M, M-1 as the training set, and 1 as the validation set;
  • Step 33 Use M-1 copies as the training set to construct a single-gene discovery model of survival data, target treatment methods, gene expression data, interaction terms between target treatment methods and each gene expression data, and obtain fitted parameters;
  • the parameters after fitting include the target treatment effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment method and the gene expression data of the gene, and the significance probability of the interaction;
  • Step 34 Use the validation set obtained in step 32 to verify the model output result of the gene discovery model
  • Step 35 determine whether the current number of training times reaches M; if yes, go to step 36; if not, go to step 32 again;
  • Step 36 Use the model output result of the training set to calculate the gene discovery model obtained in step 31;
  • Step 37 Divide the patients into the first patient and the second patient according to the different parameter combinations of the number of genes g and the judgment threshold R, perform a log-rank test on the first patient, and compare the patients receiving the target treatment with those in the first patient Differences in survival rates between patients who did not receive the target treatment; determine the number of training genes g corresponding to the minimum log-rank test result and the training judgment threshold R;
  • Step 38 Use the validation set obtained in step 31, the number of training genes g determined in step 36, and the training determination threshold R to recalculate the model output result of the gene discovery model;
  • Step 39 Determine whether the current number of training times reaches N times; if yes, go to step 40; if not, go to step 31 again;
  • Step 40 Compare the output result of the model obtained after N training with the corresponding training judgment threshold, and output the gene label including g genes and each first patient;
  • the training decision threshold may be the training decision threshold determined by the first outer training.
  • Step 41 Perform a log-rank test on the first patient to compare the survival rate difference between patients receiving the target treatment and those not receiving the target treatment; if the difference is significant, it indicates that the target treatment is beneficial to the first patient, that is, genetic The label is effective and the prediction of the first patient is accurate.
  • the sensitive gene discovery method obtains a patient data set.
  • the patient data set includes at least two sets of patient data, and each set of patient data includes multiple gene expression data, clinical data, and survival of the corresponding patient.
  • Data, clinical data include whether the target treatment method is used; obtain the gene discovery model, the model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction between the target treatment method and each gene expression data Effect parameter, the gene discovery model is used to predict the genes that the patient is sensitive to the target treatment; based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine the multiple gene expression data.
  • Genes that are sensitive to the target treatment method can solve the problem that the gene that is sensitive to the target treatment method cannot be determined, which leads to the inability to determine whether the patient is sensitive to the target treatment method, and the treatment effect may be poor; it can screen out the interaction effect with the target treatment method with statistics To construct gene labels with sensitive genes of scientific significance; using gene labels can accurately predict patients who are sensitive to the target treatment.
  • a smaller sample size can be used to discover sensitive genes, which can solve the problem of a smaller actual sample size.
  • Fig. 4 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application.
  • the device includes at least the following modules: a data acquisition module 410, a model acquisition module 420, and a gene discovery module 430.
  • the data acquisition module 410 is used to acquire a patient data set.
  • the patient data set includes at least two sets of patient data.
  • Each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient.
  • the clinical data includes whether Use targeted treatment methods;
  • the model acquisition module 420 is used to acquire a gene discovery model.
  • the model parameters of the gene discovery model include the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect between the target treatment means and each gene expression data Parameters, the gene discovery model is used to predict genes that the patient is sensitive to the target treatment;
  • the gene discovery module 430 is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data .
  • the sensitive gene discovery device provided in the above embodiments performs sensitive gene discovery
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned function assignments can be divided according to needs.
  • the functional modules are completed, that is, the internal structure of the sensitive gene discovery device is divided into different functional modules to complete all or part of the functions described above.
  • the sensitive gene discovery device provided in the foregoing embodiment and the embodiment of the sensitive gene discovery method belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • Fig. 5 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application.
  • the device at least includes a processor 501 and a memory 502.
  • the processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 501 may adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve.
  • the processor 501 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
  • the processor 501 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen.
  • the processor 501 may further include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 502 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 502 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 501 to realize the sensitivity provided by the method embodiments in this application. Gene discovery method.
  • the sensitive gene discovery apparatus may optionally further include: a peripheral device interface and at least one peripheral device.
  • the processor 501, the memory 502, and the peripheral device interface may be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface through a bus, a signal line or a circuit board.
  • peripheral devices include but are not limited to: radio frequency circuits, touch screens, audio circuits, and power supplies.
  • the sensitive gene discovery device may also include fewer or more components, which is not limited in this embodiment.
  • the present application also provides a computer-readable storage medium in which a program is stored, and the program is loaded and executed by a processor to implement the sensitive gene discovery method of the foregoing method embodiment.
  • this application also provides a computer product including a computer-readable storage medium in which a program is stored, and the program is loaded and executed by a processor to implement the above-mentioned method embodiments Sensitive gene discovery method.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Public Health (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A sensitive gene discovery method, a device and a storage medium, belonging to the field of biotechnology. The method comprises: acquiring a patient data set, the patient data set comprising at least two groups of patient data, each group of patient data comprising multiple gene expression data, clinical data and survival data corresponding to a patient, and the clinical data comprising whether a target treatment means is adopted (101); acquiring a gene discovery model, model parameters of the gene discovery model comprising an effect parameter of the target treatment means, a gene effect parameter of each gene expression data and an interaction effect parameter between the target treatment means and each gene expression data, and the gene discovery model being used for predicting a gene of a patient being sensitive to the target treatment means (102); according to a K-Fold cross-validation algorithm, fitting the model parameters of the gene discovery model by using the patient data set to determine genes that are sensitive to the target treatment means among a plurality of gene expression data (103). The described method can solve the problem of being impossible to determine whether a patient is sensitive to a target treatment means due to a gene sensitive to the target treatment means being unable to be determined; and can screen out the gene sensitive to the treatment treatment means.

Description

敏感性基因发现方法、装置及存储介质Sensitive gene discovery method, device and storage medium
本申请要求了申请日为2019年11月26日,申请号为201911170131.3的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application whose application date is November 26, 2019 and the application number is 201911170131.3, the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请涉及一种敏感性基因发现方法、装置及存储介质,属于生物技术领域。This application relates to a method, device and storage medium for discovering sensitive genes, and belongs to the field of biotechnology.
背景技术Background technique
高通量测序技术又称为“下一代”测序技术,以能一次并行对十几万条到几百万条DNA分子进行序列测定和读长相对较短为标志。高通量测序技术发展迅速,预测罹患疾病的风险已经成为可能。High-throughput sequencing technology, also known as "next generation" sequencing technology, is marked by the ability to sequence hundreds of thousands to millions of DNA molecules at a time and the read length is relatively short. The rapid development of high-throughput sequencing technology has made it possible to predict the risk of disease.
放射治疗是用不同能量的射线治疗肿瘤的一种局部治疗方法,在肿瘤治疗中的作用和地位日益突出,已成为治疗恶性肿瘤的主要手段。Radiotherapy is a local treatment method that uses radiation of different energy to treat tumors. Its role and position in tumor treatment have become increasingly prominent, and it has become the main method for treating malignant tumors.
但是某些患者可能对放射治疗不敏感,此时,若对这些患者使用放射治疗,则无法达到期望的治疗效果。因此,亟需一种对目标治疗手段(比如放疗手段)敏感的基因发现方法。However, some patients may not be sensitive to radiotherapy. At this time, if radiotherapy is used for these patients, the desired therapeutic effect cannot be achieved. Therefore, there is an urgent need for a gene discovery method that is sensitive to target treatments (such as radiotherapy).
发明内容Summary of the invention
本申请提供了一种敏感性基因发现方法、装置及存储介质,可以解决无法确定对目标治疗手段敏感的基因,从而导致无法确定患者对目标治疗手段是否敏感,治疗效果可能不佳的问题。本申请提供如下技术方案:This application provides a sensitive gene discovery method, device and storage medium, which can solve the problem that genes that are sensitive to the target treatment cannot be determined, which leads to the inability to determine whether the patient is sensitive to the target treatment, and the treatment effect may be poor. This application provides the following technical solutions:
第一方面,提供了一种敏感性基因发现方法,所述方法包括:In the first aspect, a method for discovering sensitive genes is provided, and the method includes:
获取患者数据集,所述患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,所述临床数据包括是否采用目标治疗手段;Acquiring a patient data set, the patient data set including at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted;
获取基因发现模型,所述基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手段与各个基因表达数据 之间的交互作用效应参数,所述基因发现模型用于预测患者对所述目标治疗手段敏感的基因;Obtain a gene discovery model, the model parameters of the gene discovery model including the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment means and each gene expression data, the gene discovery model Genes used to predict that the patient is sensitive to the target treatment;
基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合,以确定所述多个基因表达数据中对目标治疗手段敏感的基因。Based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
可选地,所述基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合,以确定所述多个基因表达数据中对目标治疗手段敏感的基因,包括:Optionally, the K-Fold-based cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data, include:
将所述患者数据集划分为N份第一数据集,其中,第u份第一数据集为在第u次使用所述第一数据集训练时的验证集,其它第一数据集为在第u次使用所述第一数据集训练时的训练集,所述u依次取1至N的整数,所述N为大于1的整数;The patient data set is divided into N first data sets, where the u-th first data set is the verification set when the first data set is used for the u-th training, and the other first data sets are The training set when the first data set is used for training u times, the u takes an integer from 1 to N in turn, and the N is an integer greater than 1;
对于每次使用所述第一数据集进行的训练,使用对应的训练集再次基于K-Fold交叉验证算法对单基因发现模型进行训练,得到训练后的单基因发现模型、各个基因的交互作用显著性概率;For each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model based on the K-Fold cross-validation algorithm again, and the single-gene discovery model after training is obtained, and the interaction of each gene is significant Probability
基于所述训练后的单基因发现模型确定训练判定阈值和输入所述基因发现模型的训练基因个数,所述训练判定阈值用于确定对应的验证集中的患者是否为对目标治疗手段敏感的第一患者;The training decision threshold is determined based on the trained single-gene discovery model and the number of training genes input to the gene discovery model. The training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment method. A patient
按照所述交互作用显著性概率由低到高的顺序,将对应的验证集中训练基因个数个基因表达数据输入所述基因发现模型,得到所述基因发现模型的模型输出结果;According to the descending order of the significance probability of the interaction, inputting the gene expression data of several training genes in the corresponding verification set into the gene discovery model to obtain the model output result of the gene discovery model;
将所述输出结果与所述训练判定阈值进行比较,以确定所述验证集中的每个患者是否属于第一患者;Comparing the output result with the training determination threshold to determine whether each patient in the verification set belongs to the first patient;
对于N次使用所述第一数据集进行的训练后得到的各个第一患者,将每个第一患者的k个基因确定为对所述目标治疗手段敏感的基因;所述k个基因基于所述第一患者对应输入至所述基因发现模型的基因确定。For each first patient obtained after training using the first data set N times, the k genes of each first patient are determined as genes sensitive to the target treatment; the k genes are based on all The first patient is determined corresponding to the gene input to the gene discovery model.
可选地,所述方法还包括:Optionally, the method further includes:
将各个第一患者中采用目标治疗手段治疗的患者的第一生存曲线与不采用目标治疗手段的患者的第二生存曲线进行log-rank检验,得到log-rank检验结果;Perform a log-rank test on the first survival curve of the patients treated with the target treatment method and the second survival curve of the patients who do not use the target treatment method among the first patients, and obtain the log-rank test result;
在log-rank检验结果小于预设显著水平时,触发执行所述将每个第一患者的k个基因确定为对所述目标治疗手段敏感的基因的步骤。When the log-rank test result is less than the preset significance level, trigger execution of the step of determining the k genes of each first patient as genes sensitive to the target treatment method.
可选地,所述对于每次使用所述第一数据集进行的训练,使用对应的训练集再次基于K-Fold交叉验证算法对单基因发现模型进行训练,得到训练后的单基因发现模型、各个基因的交互作用显著性概率,包括:Optionally, for each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model again based on the K-Fold cross-validation algorithm to obtain the trained single-gene discovery model, The significance probability of the interaction of each gene includes:
将第v次对应的训练集划分为M份的第二数据集;Divide the training set corresponding to the vth time into M second data sets;
基于K-Fold交叉验证算法使用所述第二数据集对预设的单基因发现模型进行M次参数拟合,得到拟合后的参数;Use the second data set to perform M parameter fitting on the preset single-gene discovery model based on the K-Fold cross-validation algorithm to obtain the fitted parameters;
其中,所述单基因发现模型为每个基因对应的基因发现模型;Wherein, the single-gene discovery model is a gene discovery model corresponding to each gene;
所述拟合后的参数包括每个基因对应的目标治疗手段效应参数、基因效应参数、目标治疗手段与所述基因的基因表达数据之间的交互作用效应参数和交互作用显著性概率;The parameters after fitting include the target treatment means effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment means and the gene expression data of the gene, and the significance probability of the interaction;
第p份第二数据集为在第p次使用第二数据集进行拟合时的验证集,其它第二数据集为在第p次使用第二数据集进行拟合时的训练集,所述v为小于或等于所述N的正整数,所述p为依次取1至M的整数,所述M为大于1的整数。The p-th second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting. v is a positive integer less than or equal to the N, the p is an integer from 1 to M in sequence, and the M is an integer greater than 1.
可选地,所述基于所述训练后的单基因发现模型确定训练判定阈值和输入所述基因发现模型的训练基因个数,包括:Optionally, the determining the training determination threshold based on the trained single-gene discovery model and the number of training genes input to the gene discovery model include:
对于交互作用效应参数小于0的每个基因按照交互作用显著性概率从小到大的顺序进行排序;For each gene whose interaction effect parameter is less than 0, sort the genes in descending order of the significance probability of the interaction;
获取多组参数组合,每组参数组合包括基因个数和判定阈值;Obtain multiple sets of parameter combinations, each of which includes the number of genes and the threshold value;
对于每组参数组合,按照排序后的顺序将所述参数组合中的基因个数个基因表达数据输入所述基因发现模型,得到模型输出结果;所述基因发现模型包括每个基因对应的训练后的单基因发现模型;For each group of parameter combinations, input the gene expression data of the genes in the parameter combination into the gene discovery model in the sorted order to obtain the model output result; the gene discovery model includes the post-training data corresponding to each gene Single-gene discovery model;
将所述模型输出结果与所述参数组合中的判定阈值进行比较,以确定所述对应的训练集中对所述目标治疗手段敏感的第一患者以及对所述目标治疗手段不敏感的第二患者;The output result of the model is compared with the judgment threshold in the parameter combination to determine the first patient who is sensitive to the target treatment method and the second patient who is not sensitive to the target treatment method in the corresponding training set ;
将各个第一患者中将采用目标治疗手段治疗的患者的第一生存曲线与不采用目标治疗手段治疗的患者的第二生存曲线进行log-rank检验,得到log-rank检验结果;Perform a log-rank test on the first survival curve of the patients who will be treated with the target treatment method and the second survival curve of the patients who will not be treated with the target treatment method among the first patients, and obtain the log-rank test result;
从所述多组参数组合对应的log-rank检验结果中,确定log-rank检验结果的最小值对应的参数组合,得到所述训练判定阈值和所述训练基因个数。From the log-rank test results corresponding to the multiple sets of parameter combinations, determine the parameter combination corresponding to the minimum value of the log-rank test result to obtain the training determination threshold and the number of training genes.
可选地,所述k的取值与第一次使用所述第一数据集进行的训练时得到的训练基因个数相同。Optionally, the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.
可选地,所述基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合之前,还包括:Optionally, before the K-Fold cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model, the method further includes:
对于每组患者数据,将所述患者数据中的基因表达数据依次输入单基因发现模型;For each group of patient data, sequentially input the gene expression data in the patient data into the single gene discovery model;
对于每个基因表达数据,以所述患者数据的生存数据为所述单基因发现模型的输出结果,确定所述基因表达数据对应的交互作用效应参数和交互作用显著性概率;For each gene expression data, using the survival data of the patient data as the output result of the single gene discovery model, determine the interaction effect parameter and the interaction significance probability corresponding to the gene expression data;
将所述交互作用效应参数大于或等于0,且所述交互作用显著性概率大于预设显著值的基因筛除。The genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than a preset significance value are screened out.
可选地,所述基因发现模型为比例风险回归模型或者为logistic模型。Optionally, the gene discovery model is a proportional hazard regression model or a logistic model.
第二方面,提供了一种敏感性基因发现装置,所述装置包括:In a second aspect, a sensitive gene discovery device is provided, the device comprising:
数据获取模块,用于获取患者数据集,所述患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,所述临床数据包括是否采用目标治疗手段;The data acquisition module is used to acquire a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether to use Target treatment
模型获取模块,用于获取基因发现模型,所述基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手段 与各个基因表达数据之间的交互作用效应参数,所述基因发现模型用于预测患者对所述目标治疗手段敏感的基因;The model acquisition module is used to acquire a gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data , The gene discovery model is used to predict genes that the patient is sensitive to the target treatment method;
基因发现模块,用于基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合,以确定所述多个基因表达数据中对目标治疗手段敏感的基因。The gene discovery module is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
第三方面,提供一种敏感性基因发现装置,所述装置包括处理器和存储器;所述存储器中存储有程序,所述程序由所述处理器加载并执行以实现第一方面所述的敏感性基因发现方法。In a third aspect, a sensitive gene discovery device is provided. The device includes a processor and a memory; the memory stores a program, and the program is loaded and executed by the processor to realize the sensitive gene described in the first aspect. Sex gene discovery method.
第四方面,提供一种计算机可读存储介质,所述存储介质中存储有程序,所述程序由所述处理器加载并执行以实现第一方面所述的敏感性基因发现方法。In a fourth aspect, a computer-readable storage medium is provided, and a program is stored in the storage medium, and the program is loaded and executed by the processor to realize the sensitive gene discovery method described in the first aspect.
本申请的有益效果在于:通过获取患者数据集,患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,临床数据包括是否采用目标治疗手段;获取基因发现模型,基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手段与各个基因表达数据之间的交互作用效应参数,基因发现模型用于预测患者对目标治疗手段敏感的基因;基于K-Fold交叉验证算法使用患者数据集对基因发现模型的模型参数进行拟合,以确定多个基因表达数据中对目标治疗手段敏感的基因;可以解决无法确定对目标治疗手段敏感的基因,从而导致无法确定患者对目标治疗手段是否敏感,治疗效果可能不佳的问题;能够筛选出与目标治疗手段交互效应有统计学意义的敏感基因构建基因标签;使用基因标签能够准确预测对目标治疗手段敏感的患者。The beneficial effect of this application is that by acquiring a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether the target treatment is adopted Means: Get the gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data. The gene discovery model is used Predict the genes that patients are sensitive to the target treatment; based on the K-Fold cross-validation algorithm, use the patient data set to fit the model parameters of the gene discovery model to determine the genes that are sensitive to the target treatment in multiple gene expression data; it can be resolved Unable to determine the genes that are sensitive to the target treatment, which leads to the problem that it is impossible to determine whether the patient is sensitive to the target treatment, and the treatment effect may be poor; it is possible to screen out sensitive genes with statistically significant interaction effects with the target treatment to construct a gene label; The use of gene tags can accurately predict patients who are sensitive to the target treatment.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,并可依照说明书的内容予以实施,以下以本申请的较佳实施例并配合附图详细说明如后。The above description is only an overview of the technical solution of the present application. In order to understand the technical means of the present application more clearly and implement it in accordance with the content of the description, the following detailed descriptions are given below with the preferred embodiments of the present application in conjunction with the accompanying drawings.
附图说明Description of the drawings
图1是本申请一个实施例提供的敏感性基因发现方法的流程图;Fig. 1 is a flowchart of a method for discovering sensitive genes provided by an embodiment of the present application;
图2是本申请一个实施例提供的确定训练基因个数和训练判定阈值的示意图;FIG. 2 is a schematic diagram of determining the number of training genes and the training judgment threshold provided by an embodiment of the present application;
图3是本申请另一个实施例提供的敏感性基因发现方法的流程图;Fig. 3 is a flowchart of a method for discovering sensitive genes provided by another embodiment of the present application;
图4是本申请一个实施例提供的敏感性基因发现装置的框图;Figure 4 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application;
图5是本申请一个实施例提供的敏感性基因发现装置的框图。Fig. 5 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和实施例,对本申请的具体实施方式作进一步详细描述。以下实施例用于说明本申请,但不用来限制本申请的范围。The specific implementation of the present application will be described in further detail below in conjunction with the accompanying drawings and embodiments. The following examples are used to illustrate the application, but are not used to limit the scope of the application.
首先,对本申请涉及的若干名词进行介绍:First, introduce several terms involved in this application:
对目标治疗手段敏感:是指在采用目标治疗手段治疗的情况下,患者能够获得更好的生存收益的特性。比如:放疗敏感性(Radiosensitivity):在接受放疗的情况下部分患者能够获得更好的生存收益的特性。其中,放射治疗(Radiotherapy):简称放疗,是恶性肿瘤的治疗手段之一,是用各种不同能量的射线照射肿瘤,以抑制和杀灭癌细胞。Sensitivity to target treatment means: It refers to the characteristic that patients can obtain better survival benefits under the condition of treatment with target treatment means. For example: Radiosensitivity (Radiosensitivity): in the case of receiving radiotherapy, some patients can obtain better survival benefits. Among them, radiotherapy (Radiotherapy): referred to as radiotherapy, is one of the treatment methods for malignant tumors, which is to irradiate the tumor with rays of different energy to inhibit and kill cancer cells.
基因标签(Gene signature):基因的集合,通常与疾病诊断和预后相关,在应用上能够从遗传因子的角度更深入地了解疾病。Gene signature: A collection of genes, usually related to disease diagnosis and prognosis, and can be used to gain a deeper understanding of diseases from the perspective of genetic factors.
交叉验证(Cross-validation):主要应用于建模应用中,在给定的建模样本中抽选大部分样本作训练集,剩余的部分用作验证集,用训练集对分类器进行训练,再利用验证集来测试训练得到的模型,来评价分类器的优劣。Cross-validation: Mainly used in modeling applications. Most of the samples are selected as the training set from the given modeling samples, and the remaining part is used as the validation set. The training set is used to train the classifier. Then use the validation set to test the trained model to evaluate the pros and cons of the classifier.
风险比(Hazard ratio):变量在暴露水平的风险率与非暴露水平时的风险率之比,既能记录发生事件的结果,同时也能体现事件发生所经历的时间。Hazard ratio: The ratio of the variable's risk rate at the exposure level to the risk rate at the non-exposure level. It can not only record the result of an event, but also reflect the time it takes for the event to occur.
精准医学:是依据患者内在的生物学信息以及临床症状和体征,对患者实施关于健康医疗和临床决策的量身定制方案。Precision medicine: It is based on the patient's internal biological information and clinical symptoms and signs to implement tailor-made plans for the patient's health care and clinical decision-making.
交互作用显著性概率:拒绝原假设交互作用不对生存产生显著影响,接受备择假设交互作用对生存产生显著影响。即交互作用对生存产生显著影响这一做法是错误做法的概率。交互作用的显著性越高,交互作用显著性概率P值越小。交互作用显著性概率P值越小,说明这一做法错误的可能性越低。通常通 过基因发现模型,将基因表达数据与目标治疗手段的交互项加入模型,可得到包括交互作用项的各项效应显著性概率。Probability of interaction significance: Rejecting the null hypothesis interaction does not have a significant impact on survival, and accepting the alternative hypothesis interaction has a significant impact on survival. That is, the probability that the interaction has a significant impact on survival is a wrong approach. The higher the significance of the interaction, the smaller the P value of the significance probability of the interaction. The smaller the P value of the significance probability of the interaction, the lower the possibility that this approach is wrong. Usually through gene discovery models, the interaction terms between gene expression data and the target treatment are added to the model, and the significance probability of each effect including the interaction terms can be obtained.
比例风险回归模型(Proportional hazards model):又称Cox回归模型,是由英国统计学家D.R.Cox于1972年提出的一种半参数回归模型。该模型可以用来描述不随时间变化的多个特征对于在某一时刻死亡率的影响。它是一个在生存分析中的一个重要的模型。Proportional hazards model: Also known as Cox regression model, it is a semiparametric regression model proposed by British statistician D.R.Cox in 1972. This model can be used to describe the impact of multiple characteristics that do not change over time on the mortality rate at a certain moment. It is an important model in survival analysis.
logistic回归模型:又称logistic回归分析,主要用于流行病学中。例如,想探讨胃癌发生的危险因素,可以选择两组人群,一组是胃癌组,一组是非胃癌组,两组人群肯定有不同的体征和生活方式等。Logistic regression model: also known as logistic regression analysis, mainly used in epidemiology. For example, if you want to explore the risk factors of gastric cancer, you can choose two groups of people, one group is gastric cancer group, the other group is non-gastric cancer group, the two groups of people must have different physical signs and lifestyles.
生存曲线(Survival curve):以随访时间为横轴,生存率为纵轴,将各点连成曲线。Survival curve: Take the follow-up time as the horizontal axis and the survival rate as the vertical axis, connecting the points to form a curve.
log-rank检验(又称对数秩检验):用于生存数据的组间比较。检验统计量为卡方。Log-rank test (also known as log-rank test): used to compare survival data between groups. The test statistic is chi-square.
可选地,本申请以各个实施例的执行主体为电子设备为例进行说明,该电子设备可以是计算机、平板电脑或者手机等终端设备;或者,也可以是服务器等,本实施例不对电子设备的类型作限定。Optionally, this application takes an electronic device as an example for the execution of each embodiment. The electronic device may be a terminal device such as a computer, a tablet, or a mobile phone; alternatively, it may also be a server, etc. This embodiment does not apply to electronic devices. The type is limited.
图1是本申请一个实施例提供的敏感性基因发现方法的流程图。该方法至少包括以下几个步骤:Fig. 1 is a flowchart of a method for discovering sensitive genes provided by an embodiment of the present application. The method includes at least the following steps:
步骤101,获取患者数据集,该患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,该临床数据包括是否采用目标治疗手段。Step 101: Obtain a patient data set. The patient data set includes at least two sets of patient data. Each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted.
其中,目标治疗手段可以是放射治疗手段;当然,也可以是其它与基因有关的治疗手段,本实施例对此不作限定。Among them, the target treatment method may be radiotherapy; of course, it may also be other gene-related treatment methods, which is not limited in this embodiment.
基因表达数据反映的是直接或间接测量得到的基因转录产物mRNA在细胞中的丰度。这些数据可以用于分析哪些基因的表达发生了改变,基因之间有何相关性,在不同条件下基因的活动是如何受影响的。它们在医学临床诊断、 药物疗效判断、揭示疾病发生机制等方面有重要的应用。目前,高通量检测基因组mRNA丰度的方法主要是cDNA微阵列、寡核苷酸芯片,随着cDNA微阵列和寡核苷酸芯片等高通量检测技术的发展,我们可以从全基因组水平定量或定性检测基因转录产物mRNA。比如:参考表一所示的每组患者数据中的基因表达数据,对于每组患者数据中的基因表达数据包括:患者的标准化多基因mRNA表达值二维数据集,该二维数据集的行变量表示不同患者(表一以371个患者为例),列变量表示不同基因的基因表达值(表一以14343个基因的基因表达值为例)。Gene expression data reflects the abundance of gene transcript mRNA in cells measured directly or indirectly. These data can be used to analyze which genes have changed their expression, what are the correlations between genes, and how the activities of genes are affected under different conditions. They have important applications in medical clinical diagnosis, drug efficacy judgment, and revealing the mechanism of disease. At present, the methods for high-throughput detection of genomic mRNA abundance are mainly cDNA microarrays and oligonucleotide chips. With the development of high-throughput detection technologies such as cDNA microarrays and oligonucleotide chips, we can start from the whole genome level. Quantitative or qualitative detection of gene transcription product mRNA. For example, referring to the gene expression data in each group of patient data shown in Table 1, the gene expression data in each group of patient data includes: the patient's standardized multi-gene mRNA expression value two-dimensional data set, and the rows of the two-dimensional data set The variables represent different patients (Table 1 takes 371 patients as an example), and the column variables represent the gene expression values of different genes (Table 1 takes the gene expression values of 14343 genes as an example).
表一:Table I:
 To 基因1Gene 1 基因2Gene 2 基因3Gene 3 基因14343Gene 14343
患者1Patient 1 -0.1419-0.1419 -0.5401-0.5401 -0.5733-0.5733 -0.5251-0.5251
患者2Patient 2 -0.4113-0.4113 -0.5112-0.5112 -0.5161-0.5161 -0.5347-0.5347
患者371Patient 371 -0.5173-0.5173 -0.5173-0.5173 -0.4237-0.4237 -0.5754-0.5754
临床数据用于指示患者在临床方面的表现。示意性地,临床数据包括是否采用目标治疗手段治疗。当然,临床数据还包括但不限于:患者对某一疾病的诊断年龄、组织学分类、病理学分类和/或用药情况。比如:参考表二所示的每组患者数据中的临床数据,对于每组患者数据中的临床数据包括:患者多种类型的二维临床数据集,该二维临床数据集的行变量表示不同患者(表二以371个患者为例),列变量表示不同类型的临床数据(表二以5种类型的临床数据为例)。其中,对于不同类型的疾病,对应的组织学分类有所不同,比如:对于甲状腺腺瘤,组织学分类包括:滤泡状腺瘤(包括单纯性腺瘤和嗜酸性腺瘤);乳头状腺瘤;髓样肿瘤和未分化肿瘤;对于肺癌,组织学分类包括:鳞状细胞癌、腺癌、小细胞癌、大细胞癌,本实施例不对组织学分类方式作限定。对于不同类型的疾病,对应的病理学分类有所不同,比如:对于肝硬化的病理学分类包括:小结节性肝硬化、大结节性肝硬化、大小结节性肝硬化、不完全分隔性肝 硬化四种;对于脂肪肝的病理学分类包括:单纯性脂肪肝、脂肪性肝炎、脂肪性肝纤维化、脂肪性肝硬化,本实施例不对病理学分类方式作限定。Clinical data is used to indicate the patient's clinical performance. Illustratively, the clinical data includes whether the target treatment is used or not. Of course, clinical data also includes, but is not limited to: the patient's age at diagnosis for a certain disease, histological classification, pathological classification, and/or medication use. For example, referring to the clinical data in each group of patient data shown in Table 2, the clinical data in each group of patient data includes: a two-dimensional clinical data set of various types of patients, and the row variables of the two-dimensional clinical data set represent different For patients (Table 2 takes 371 patients as an example), the column variables represent different types of clinical data (Table 2 takes 5 types of clinical data as examples). Among them, for different types of diseases, the corresponding histological classification is different, such as: for thyroid adenoma, histological classification includes: follicular adenoma (including simple adenoma and eosinophilic adenoma); papillary adenoma ; Medullary tumors and undifferentiated tumors; for lung cancer, histological classification includes: squamous cell carcinoma, adenocarcinoma, small cell carcinoma, large cell carcinoma, this embodiment does not limit the histological classification. For different types of diseases, the corresponding pathological classifications are different. For example, the pathological classifications of cirrhosis include: small nodular cirrhosis, large nodular cirrhosis, large and small nodular cirrhosis, incomplete separation There are four types of liver cirrhosis; the pathological classification of fatty liver includes: simple fatty liver, steatohepatitis, fatty liver fibrosis, and fatty liver cirrhosis. This embodiment does not limit the pathological classification.
表二:Table II:
Figure PCTCN2020126658-appb-000001
Figure PCTCN2020126658-appb-000001
生存数据用于指示患者的生存情况。生存数据包括患者的生存结局,以及出现该结局经过的时长。其中,生存结局可以是生存风险值,也可以是二分类值(即生存或者死亡)。基于表一和表二的患者数量,对应的生存结局y共有371个。Survival data is used to indicate the survival of patients. Survival data includes the patient's survival outcome and the length of time that has elapsed. Among them, the survival outcome can be a survival risk value or a binary value (that is, survival or death). Based on the number of patients in Table 1 and Table 2, there are a total of 371 corresponding survival outcomes.
步骤102,获取基因发现模型,该基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手段与各个基因表达数据之间的交互作用效应参数,基因发现模型用于预测患者对目标治疗手段敏感的基因。Step 102: Obtain a gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data, and the gene discovery model Genes used to predict the patient's sensitivity to the target treatment.
可选地,基因发现模型为比例风险回归模型或者为logistic模型。在基因发现模型为比例风险回归模型时,模型输出结果为生存风险值;在基因发现模型为logistic模型时,模型输出结果为二分类值。Optionally, the gene discovery model is a proportional hazard regression model or a logistic model. When the gene discovery model is a proportional hazard regression model, the model output is the survival risk value; when the gene discovery model is a logistic model, the model output is a binary value.
示意性地,基因发现模型通过下式表示:Schematically, the gene discovery model is represented by the following formula:
h(t|X)=h 0(t)exp(rλ+x 1b 1+x 2b 2+…+x sb s+rx 1i 1+rx 2i 2+…+rx si s) h(t|X)=h 0 (t)exp(rλ+x 1 b 1 +x 2 b 2 +…+x s b s +rx 1 i 1 +rx 2 i 2 +…+rx s i s )
其中,h 0(t)是基础风险函数;λ是目标治疗手段的效应参数;r用于指示是否接受目标治疗手段,若接受目标治疗手段,则r的值为1;若未接受目标治疗手段,则r的值为0;x 1、x 2、...x s是各个基因对应的基因表达数据;b 1、b 2、…b s是各个基因的基因效应参数;i 1、i 2、…i s是目标治疗手段与各个基因表达数据 之间的交互作用效应参数,用于反映目标治疗手段对生存的作用受到对应基因的表达水平的影响程度。 Among them, h 0 (t) is the basic risk function; λ is the effect parameter of the target treatment method; r is used to indicate whether to accept the target treatment method, if the target treatment method is accepted, the value of r is 1; if the target treatment method is not accepted , Then the value of r is 0; x 1 , x 2 ,...x s are the gene expression data corresponding to each gene; b 1 , b 2 ,...b s are the gene effect parameters of each gene; i 1 , i 2 , ... i s treatment with the respective target gene interaction effects between the parameter data, means for reflecting the target treatment effect on survival by the influence of the level of expression of the corresponding gene.
对于任一患者,风险比(HR)为exp(r λ+x jb j+rx ji j)。如果交互作用效应参数为负值,那么HR可能小于1,此时,对目标治疗手段敏感的患者接目标治疗手段的生存率比不接受目标治疗手段的生存率更高。如果部分患者具有对目标治疗手段敏感的基因,则其总的风险比(nHR)将趋于非常小,小于预设的阈值。那么,这些预期生存率相对较高的患者即为对目标治疗手段敏感的第一患者,输入基因发现模型的基因表达数据的子集即为对目标治疗手段敏感的敏感性基因的基因表达数据。 For any patient, the hazard ratio (HR) is exp(r λ + x j b j + rx j i j ). If the interaction effect parameter is a negative value, then the HR may be less than 1. At this time, the survival rate of patients who are sensitive to the target treatment is higher than the survival rate of patients who do not receive the target treatment. If some patients have genes that are sensitive to the target treatment, their total risk ratio (nHR) will tend to be very small, less than the preset threshold. Then, these patients with a relatively high expected survival rate are the first patients who are sensitive to the target treatment, and the subset of gene expression data entered into the gene discovery model is the gene expression data of sensitive genes that are sensitive to the target treatment.
基于上述原理可知,本实施例中可以通过对基因发现模型进行训练得到对目标治疗手段敏感的敏感性基因。Based on the foregoing principles, it can be known that in this embodiment, the gene discovery model can be trained to obtain sensitive genes that are sensitive to the target treatment.
步骤103,基于K-Fold交叉验证算法使用患者数据集对基因发现模型的模型参数进行拟合,以确定多个基因表达数据中对目标治疗手段敏感的基因。Step 103: Use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm, so as to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
在一个示例中,电子设备可以使用患者所有的基因表达数据对基因发现模型的模型参数进行拟合。In one example, the electronic device can use all the gene expression data of the patient to fit the model parameters of the gene discovery model.
在另一个示例中,电子设备可以对患者的所有基因表达数据进行初步筛选,以将对目标治疗手段明显不敏感的基因筛除,此时,在本步骤之前,对于每组患者数据,将患者数据中的基因表达数据依次输入单基因发现模型;对于每个基因表达数据,以患者数据的生存数据为单基因发现模型的输出结果,确定基因表达数据对应的交互作用效应参数和交互作用显著性概率;将交互作用效应参数大于或等于0,且交互作用显著性概率大于预设显著值的基因筛除。In another example, the electronic device can perform a preliminary screening of all gene expression data of the patient to screen out genes that are obviously insensitive to the target treatment. At this time, before this step, for each group of patient data, the patient The gene expression data in the data are sequentially input into the single gene discovery model; for each gene expression data, the survival data of the patient data is used as the output result of the single gene discovery model to determine the interaction effect parameters and the significance of the interaction effect corresponding to the gene expression data Probability: The genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than the preset significance value are screened out.
其中,单基因发现模型是指每个基因对应的模型,其模型参数的类型与基因发现模型的模型参数的类型相同。基因发现模型中的目标治疗手段的效应参数是各个单基因发现模型中的目标治疗手段的效应参数的平均值;基因发现模型中每个基因的基因效应参数为该基因对应单基因发现模型的基因效应参数;基因发现模型中每个基因的交互作用效应参数为该基因对应的单基因发现模型的交互效应参数。Among them, the single-gene discovery model refers to the model corresponding to each gene, and the type of its model parameter is the same as that of the gene discovery model. The effect parameter of the target treatment in the gene discovery model is the average of the effect parameters of the target treatment in each single gene discovery model; the gene effect parameter of each gene in the gene discovery model is the gene corresponding to the single gene discovery model. Effect parameter; the interaction effect parameter of each gene in the gene discovery model is the interaction effect parameter of the single gene discovery model corresponding to the gene.
此时,电子设备使用患者的每个基因表达数据与是否采用目标治疗手段治疗的数据相乘得到目标治疗手段与各个基因表达数据之间的交互作用项;对于每个基因,构建生存数据、目标治疗手段、基因表达数据、目标治疗手段与各个基因表达数据之间的交互作用项的单基因发现模型。以表一的基因表达数据为例,由于基因表达数据14343个,因此,需要构建14343次单基因发现模型,每个单基因发现模型都可得到各项的效应参数,效应参数的标准误差和交互作用显著性概率。初步的筛选对目标治疗手段敏感的敏感性基因包括:将交互作用效应参数是否小于0,以及交互作用项的交互作用显著性概率是否小于预设显著值(比如0.05)的基因确定为敏感性基因,也即,将交互作用效应参数大于或等于0,且交互作用显著性概率大于预设显著值的基因筛除。假设对表一所述的基因表达数据进行筛选后,筛选出268个符合上述要求的敏感性基因。At this time, the electronic device uses the patient’s gene expression data to multiply the data on whether the target treatment method is used or not to obtain the interaction item between the target treatment method and each gene expression data; for each gene, construct survival data and target A single-gene discovery model of interaction terms between treatment means, gene expression data, target treatment means and each gene expression data. Take the gene expression data in Table 1 as an example. Since there are 14343 gene expression data, 14343 single-gene discovery models need to be constructed. Each single-gene discovery model can obtain various effect parameters, standard errors and interactions of effect parameters. Probability of significance. The preliminary screening of sensitive genes that are sensitive to the target treatment includes: determining whether the interaction effect parameter is less than 0, and whether the interaction significance probability of the interaction item is less than the preset significant value (such as 0.05) is determined as the sensitive gene , That is, the genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than the preset significance value are screened out. Suppose that after screening the gene expression data described in Table 1, 268 sensitive genes that meet the above requirements are screened out.
其中,预设显著值以0.05为例进行说明,在实际实现时,预设显著值也可以是其它数值,本实施例不对预设显著值的取值作限定。Wherein, the preset significant value is described by taking 0.05 as an example. In actual implementation, the preset significant value may also be other values, and this embodiment does not limit the value of the preset significant value.
其中,基于K-Fold交叉验证算法使用患者数据集对基因发现模型的模型参数进行拟合,以确定多个基因表达数据中对目标治疗手段敏感的基因,包括:Among them, based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine the genes that are sensitive to the target treatment in multiple gene expression data, including:
1、将患者数据集划分为N份第一数据集,其中,其中,第u份第一数据集为在第u次使用第一数据集训练时的验证集,其它第一数据集为在第u次使用第一数据集训练时的训练集,u依次取1至N的整数,N为大于1的整数。1. Divide the patient data set into N first data sets, where the u-th first data set is the verification set when the first data set is used for the u-th training, and the other first data sets are in the u-th training. The training set when the first data set is used for training u times, u takes an integer from 1 to N in turn, and N is an integer greater than 1.
其中,N份第一数据集的大小相同或者不同。以371个患者为例,比如:将患者数据集分成10(即N=10)份相同大小的样本,做10折交叉验证。第一次训练,将其中的9份,即334个患者的患者数据作为训练集来拟合基因发现模型中的模型参数,另外的1份,即37个患者的患者数据作为验证集使用参数拟合后的基因发现模型来计算nHR值;第二次训练,将第一次用于训练的9份患者数据中的1份作为验证集,第一次作为验证集的37个患者数据加入训练集,重新对基因发现模型进行一次参数拟合,计算nHR值。总共训练10次,验证集交换10次,涵盖所有的患者数据,得到完整的334个患者的nHR值。Among them, the sizes of the N first data sets are the same or different. Take 371 patients as an example. For example, divide the patient data set into 10 (ie, N=10) samples of the same size, and do 10-fold cross-validation. For the first training, nine of them, that is, the patient data of 334 patients, are used as the training set to fit the model parameters in the gene discovery model, and the other one, that is, the patient data of 37 patients is used as the validation set. The combined gene discovery model is used to calculate the nHR value; in the second training, one of the 9 patient data used for training for the first time is used as the validation set, and the 37 patient data for the first time as the validation set is added to the training set , And re-fit the gene discovery model to calculate the nHR value. A total of 10 trainings and 10 exchanges of validation sets cover all patient data, resulting in a complete nHR value of 334 patients.
当然,N的取值也是其它数值,本实施例不对N的取值作限定。Of course, the value of N is also another value, and this embodiment does not limit the value of N.
2、对于每次使用第一数据集进行的训练,使用对应的训练集再次基于K-Fold交叉验证算法对单基因发现模型进行训练,得到训练后的单基因发现模型、各个基因的交互作用显著性概率。2. For each training using the first data set, use the corresponding training set to train the single-gene discovery model based on the K-Fold cross-validation algorithm again, and obtain the single-gene discovery model after training, and the interaction of each gene is significant Probability.
本申请中,在每次交叉验证训练过程中再次嵌套K-Fold交叉验证算法。示意性地,将第v次对应的训练集划分为M份的第二数据集;基于K-Fold交叉验证算法使用第二数据集对预设的单基因发现模型进行M次参数拟合,得到拟合后的参数。In this application, the K-Fold cross-validation algorithm is nested again during each cross-validation training process. Schematically, the training set corresponding to the vth time is divided into M second data sets; based on the K-Fold cross-validation algorithm, the second data set is used to perform M parameter fitting on the preset single-gene discovery model to obtain Parameters after fitting.
其中,拟合后的参数包括每个基因对应的目标治疗手段效应参数、基因效应参数、目标治疗手段与基因的基因表达数据之间的交互作用效应参数和交互作用显著性概率;第p份第二数据集为在第p次使用第二数据集进行拟合时的验证集,其它第二数据集为在第p次使用第二数据集进行拟合时的训练集,v为小于或等于N的正整数,p为依次取1至M的整数,M为大于1的整数。M份第二数据集的大小相同或不同。Among them, the parameters after fitting include the target treatment effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment method and the gene expression data of the gene, and the significance probability of the interaction; The second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting, and v is less than or equal to N A positive integer of, p is an integer from 1 to M in sequence, and M is an integer greater than 1. The size of the M second data sets is the same or different.
比如:基于上例,对于每次使用第一数据集进行的训练,将334个样本分成10份相同大小的样本,做10次10折交叉验证(即M=10)。本实施例中,进行内部嵌套训练的目的是为了找到最优的基因个数与判定阈值的参数组合。同样的,每一次训练都是将9份患者数据用于训练,一份患者数据用于验证,只是患者数据的数据不同。9份即301个患者数据,1份即33个患者数据。对于每一个基因(经筛选共计268个),构建生存数据、目标治疗手段、基因表达数据、目标治疗手段与各个基因表达数据之间的交互作用项的单基因发现模型。选择其中具有交互作用效应值为负值的基因,按照交互作用显著性概率从小到大排列。For example, based on the above example, for each training using the first data set, 334 samples are divided into 10 samples of the same size, and 10 10-fold cross-validation is performed (ie, M=10). In this embodiment, the purpose of internal nesting training is to find the optimal parameter combination of the number of genes and the determination threshold. Similarly, each training uses 9 pieces of patient data for training and one piece of patient data for verification, but the data of the patient data is different. Nine copies are 301 patient data, and one copy is 33 patient data. For each gene (268 in total after screening), a single gene discovery model was constructed for survival data, target treatment methods, gene expression data, and interaction terms between target treatment methods and each gene expression data. Select the genes that have a negative value for the interaction effect, and arrange them in descending order of the significance probability of the interaction.
当然,M的取值也是其它数值,本实施例不对N的取值作限定。Of course, the value of M is also another value, and this embodiment does not limit the value of N.
3、基于训练后的单基因发现模型确定训练判定阈值和输入基因发现模型的训练基因个数,该训练判定阈值用于确定对应的验证集中的患者是否为对目标治疗手段敏感的第一患者。3. Determine the training decision threshold and the number of training genes of the input gene discovery model based on the trained single gene discovery model. The training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment.
对于交互作用效应参数小于0的每个基因按照交互作用显著性概率从小到大的顺序进行排序;获取多组参数组合,每组参数组合包括基因个数和判定阈值;对于每组参数组合,按照排序后的顺序将参数组合中的基因个数个基因表达数据输入基因发现模型,得到模型输出结果;基因发现模型包括每个基因对应的训练后的单基因发现模型;将模型输出结果与参数组合中的判定阈值进行比较,以确定对应的训练集中对目标治疗手段敏感的第一患者以及对目标治疗手段不敏感的第二患者;对于各个第一患者,将第一患者中采用目标治疗手段治疗的患者的第一生存曲线与不采用目标治疗手段的患者的第二生存曲线进行log-rank检验,得到log-rank检验结果;从多组参数组合对应的log-rank检验结果中,确定log-rank检验结果的最小值对应的参数组合,得到训练判定阈值和训练基因个数。For each gene whose interaction effect parameter is less than 0, sort in descending order of the significance probability of the interaction; obtain multiple sets of parameter combinations, and each set of parameter combinations includes the number of genes and the judgment threshold; for each set of parameter combinations, follow In the sorted order, input the gene expression data of the genes in the parameter combination into the gene discovery model to obtain the model output; the gene discovery model includes the trained single gene discovery model corresponding to each gene; the model output results are combined with the parameters Compare the judgment thresholds in the corresponding training set to determine the first patient who is sensitive to the target treatment method and the second patient who is not sensitive to the target treatment method in the corresponding training set; for each first patient, the first patient is treated with the target treatment method The log-rank test is performed on the first survival curve of the patients and the second survival curve of the patients who do not use the target treatment method, and the log-rank test results are obtained; from the log-rank test results corresponding to the multiple sets of parameter combinations, the log-rank test results are determined. The parameter combination corresponding to the minimum value of the rank test result obtains the training judgment threshold and the number of training genes.
比如:基因个数的个数取值范围为[2,150],判定阈值的阈值取值范围为[0.01,0.5]。从个数取值范围中确定基因个数、从阈值取值范围中确定判断阈值组成参数组合,得到4750种参数组合。根据步骤2所示的嵌套训练步骤得到的各个单基因发现模型的各项效应参数。将各项效应参数应用于基因发现模型,计算本次最外层的训练(即使用第一数据集进行训练)在各种参数组合下患者的nHR值。根据nHR与对应的判定阈值之间的大小来判断患者是对目标治疗手段敏感的第一患者,还是对目标治疗手段不敏感的第二患者。对于第一患者,进行采用目标治疗手段的患者与不采用目标治疗手段患者的生存曲线的log-rank检验比较。在log-rank检验中log-rank检验最小的一组对应的基因个数和判定阈值为最优的参数组合。参考图2所示的电子设备自动选取的训练基因个数和训练判定阈值,其中,训练基因个数为33个,训练判定阈值为0.01。For example, the range of the number of genes is [2, 150], and the range of the threshold value of the judgment threshold is [0.01, 0.5]. Determine the number of genes from the number value range, and determine the judgment threshold value composition parameter combination from the threshold value range, resulting in 4750 parameter combinations. According to the nested training steps shown in step 2, the effect parameters of each single gene discovery model are obtained. Apply each effect parameter to the gene discovery model, and calculate the nHR value of the patient under various parameter combinations in this outermost training (that is, using the first data set for training). According to the size between the nHR and the corresponding judgment threshold, it is judged whether the patient is the first patient who is sensitive to the target treatment method or the second patient who is not sensitive to the target treatment method. For the first patient, the log-rank test of the survival curve of patients who used the target treatment method and the patient who did not use the target treatment method was compared. In the log-rank test, the log-rank test has the smallest number of corresponding genes and the parameter combination with the optimal threshold. Referring to the number of training genes and the training determination threshold automatically selected by the electronic device shown in FIG. 2, the number of training genes is 33, and the training determination threshold is 0.01.
4、按照交互作用显著性概率由低到高的顺序,将对应的验证集中训练基因个数个基因表达数据输入基因发现模型,得到基因发现模型的模型输出结果。4. According to the descending order of the significance probability of the interaction, input several gene expression data of the training genes in the corresponding verification set into the gene discovery model, and obtain the model output results of the gene discovery model.
5、将输出结果与训练判定阈值进行比较,以确定验证集中的每个患者是否属于第一患者。5. Compare the output result with the training decision threshold to determine whether each patient in the validation set belongs to the first patient.
根据第3步确定出的训练基因个数g和训练判定阈值R,在嵌套训练的验证集中计算患者前g个基因的模型输出结果HR(nHR)如下:According to the number of training genes determined in step 3 and the training decision threshold R, the model output HR(nHR) of the first g genes of the patient is calculated in the validation set of nested training as follows:
Figure PCTCN2020126658-appb-000002
Figure PCTCN2020126658-appb-000002
其中,λ是目标治疗手段的效应参数,λ为g个(图2中为33个)单基因模型估计的平均值。若nHR小于0.01,则将患者归类为对目标治疗手段敏感的第一患者;若nHR大于或等于0.01,则将患者归类为对目标治疗手段不敏感的第二患者。由于每位患者会在验证集中出现一次,因此,经过步骤1至4的训练过程,所有患者即被分类为第一患者或第二患者。Among them, λ is the effect parameter of the target treatment, and λ is the average value estimated by g (33 in Figure 2) single gene model. If the nHR is less than 0.01, the patient is classified as the first patient who is sensitive to the target treatment; if the nHR is greater than or equal to 0.01, the patient is classified as the second patient who is not sensitive to the target treatment. Since each patient will appear once in the verification set, all patients will be classified as the first patient or the second patient after the training process of steps 1 to 4.
6、对于N次使用第一数据集进行的训练后得到的各个第一患者,将每个第一患者的k个基因确定为对目标治疗手段敏感的基因;其中,k个基因基于第一患者对应输入至基因发现模型的基因确定。6. For each first patient obtained after training using the first data set N times, determine the k genes of each first patient as genes that are sensitive to the target treatment; among them, the k genes are based on the first patient Corresponding to the gene determination input to the gene discovery model.
可选地,对于N次使用第一数据集进行的训练后得到的各个第一患者,将各个第一患者中采用目标治疗手段治疗的患者的第一生存曲线与不采用目标治疗手段的患者的第二生存曲线进行log-rank检验,得到log-rank检验结果;在log-rank检验结果小于预设显著水平时再执行将每个第一患者的k个基因确定为对目标治疗手段敏感的基因的步骤。Optionally, for each first patient obtained after training using the first data set N times, the first survival curve of the patients treated with the target treatment method among the first patients is compared with the first survival curve of the patients who do not use the target treatment method. Perform log-rank test on the second survival curve to get the log-rank test result; when the log-rank test result is less than the preset significance level, execute the k genes of each first patient as genes that are sensitive to the target treatment A step of.
可选地,在log-rank检验结果大于或等于预设显著水平时流程结束。Optionally, the process ends when the log-rank test result is greater than or equal to the preset significance level.
比如:对于预测的第一患者,在一个指定的显著水平(即预设显著水平)上通过log-rank检验比较接受目标治疗手段的患者和未接受目标治疗手段的患者的生存曲线。生存情况若显著改善(即log-rank检验结果的检验结果小于显著水平),则表明目标治疗手段对第一患者有益,即基因标签是有效的,并且对第一患者的预测也是准确的。本例预测371名患者,其中174名属于第一患者,197名属于第二患者。For example, for the first predicted patient, compare the survival curves of patients who received the target treatment with those who did not receive the target treatment at a specified significance level (ie, the preset significance level) through a log-rank test. If the survival condition is significantly improved (that is, the test result of the log-rank test result is less than the significant level), it indicates that the target treatment method is beneficial to the first patient, that is, the gene signature is effective, and the prediction of the first patient is also accurate. This example predicts 371 patients, of which 174 belong to the first patient and 197 belong to the second patient.
其中,预设显著水平可以为0.05,当然,也可以为其它值,本实施例不对预设显著水平的取值作限定。Among them, the preset significance level can be 0.05, of course, it can also be other values, and this embodiment does not limit the value of the preset significance level.
在一个示例中,k的取值与第一次使用第一数据集进行的训练时得到的训练 基因个数相同。In an example, the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.
当然,k的值也可以是N次使用第一数据集进行的训练时得到的训练基因个数的平均值。本实施例不对k的取值作限定。Of course, the value of k can also be the average value of the number of training genes obtained when the first data set is used for training N times. This embodiment does not limit the value of k.
可选地,为了更清楚地理解本申请提供的敏感性基因发现方法,下面对该方法举一个实例进行说明,参考图3所示的敏感性基因发现过程,该过程至少包括步骤31-41:Optionally, in order to understand the sensitive gene discovery method provided in this application more clearly, the method will be described below with an example. Refer to the sensitive gene discovery process shown in Figure 3, which includes at least steps 31-41. :
步骤31,将患者数据集划分为N份,N-1份为训练集,1份为验证集;Step 31: Divide the patient data set into N parts, where N-1 part is the training set, and 1 part is the validation set;
步骤32,将N-1份训练集划分为M份,M-1份为训练集,1份为验证集;Step 32: Divide the N-1 training set into M, M-1 as the training set, and 1 as the validation set;
步骤33,使用M-1份为训练集构建生存数据、目标治疗手段、基因表达数据、目标治疗手段与各个基因表达数据之间的交互作用项的单基因发现模型,得到拟合后的参数;其中,拟合后的参数包括每个基因对应的目标治疗手段效应参数、基因效应参数、目标治疗手段与基因的基因表达数据之间的交互作用效应参数和交互作用显著性概率;Step 33: Use M-1 copies as the training set to construct a single-gene discovery model of survival data, target treatment methods, gene expression data, interaction terms between target treatment methods and each gene expression data, and obtain fitted parameters; Among them, the parameters after fitting include the target treatment effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment method and the gene expression data of the gene, and the significance probability of the interaction;
步骤34,使用步骤32中得到的验证集验证基因发现模型的模型输出结果;Step 34: Use the validation set obtained in step 32 to verify the model output result of the gene discovery model;
步骤35,确定当前训练次数是否达到M次;若是,则执行步骤36;若否,则再次执行步骤32; Step 35, determine whether the current number of training times reaches M; if yes, go to step 36; if not, go to step 32 again;
需要补充说明的是,步骤32中每次划分得到的验证集不同。It needs to be supplemented that the verification set obtained for each division in step 32 is different.
步骤36,使用步骤31中得到训练集计算基因发现模型的模型输出结果;Step 36: Use the model output result of the training set to calculate the gene discovery model obtained in step 31;
步骤37,根据基因个数g和判定阈值R的不同参数组合将患者分为第一患者和第二患者,对第一患者进行log-rank检验,比较第一患者中接受目标治疗手段的患者与未接受目标治疗手段的患者之间生存率差异;确定log-rank检验结果最小值对应的训练基因个数g和训练判定阈值R;Step 37: Divide the patients into the first patient and the second patient according to the different parameter combinations of the number of genes g and the judgment threshold R, perform a log-rank test on the first patient, and compare the patients receiving the target treatment with those in the first patient Differences in survival rates between patients who did not receive the target treatment; determine the number of training genes g corresponding to the minimum log-rank test result and the training judgment threshold R;
步骤38,使用步骤31中得到验证集、步骤36确定出的训练基因个数g和训练判定阈值R再次计算基因发现模型的模型输出结果;Step 38: Use the validation set obtained in step 31, the number of training genes g determined in step 36, and the training determination threshold R to recalculate the model output result of the gene discovery model;
步骤39,确定当前训练次数是否达到N次;若是,则执行步骤40;若否,则再次执行步骤31;Step 39: Determine whether the current number of training times reaches N times; if yes, go to step 40; if not, go to step 31 again;
需要补充说明的是,步骤31中每次划分得到的验证集不同。It needs to be supplemented that the verification set obtained for each division in step 31 is different.
步骤40,将N次训练后得到的模型输出结果与对应的训练判定阈值进行比较,输出包括g个基因的基因标签以及各个第一患者;Step 40: Compare the output result of the model obtained after N training with the corresponding training judgment threshold, and output the gene label including g genes and each first patient;
其中,训练判定阈值可以是第一次外层训练确定出的训练判定阈值。Wherein, the training decision threshold may be the training decision threshold determined by the first outer training.
步骤41,对第一患者作log-rank检验,比较接受目标治疗手段的患者与未接受目标治疗手段的患者之间生存率差;若差异明显则表明目标治疗手段对第一患者有益,即基因标签有效、对第一患者的预测准确。Step 41: Perform a log-rank test on the first patient to compare the survival rate difference between patients receiving the target treatment and those not receiving the target treatment; if the difference is significant, it indicates that the target treatment is beneficial to the first patient, that is, genetic The label is effective and the prediction of the first patient is accurate.
综上所述,本实施例提供的敏感性基因发现方法,通过获取患者数据集,患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,临床数据包括是否采用目标治疗手段;获取基因发现模型,基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手段与各个基因表达数据之间的交互作用效应参数,基因发现模型用于预测患者对目标治疗手段敏感的基因;基于K-Fold交叉验证算法使用患者数据集对基因发现模型的模型参数进行拟合,以确定多个基因表达数据中对目标治疗手段敏感的基因;可以解决无法确定对目标治疗手段敏感的基因,从而导致无法确定患者对目标治疗手段是否敏感,治疗效果可能不佳的问题;能够筛选出与目标治疗手段交互效应有统计学意义的敏感基因构建基因标签;使用基因标签能够准确预测对目标治疗手段敏感的患者。To sum up, the sensitive gene discovery method provided in this embodiment obtains a patient data set. The patient data set includes at least two sets of patient data, and each set of patient data includes multiple gene expression data, clinical data, and survival of the corresponding patient. Data, clinical data include whether the target treatment method is used; obtain the gene discovery model, the model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction between the target treatment method and each gene expression data Effect parameter, the gene discovery model is used to predict the genes that the patient is sensitive to the target treatment; based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine the multiple gene expression data. Genes that are sensitive to the target treatment method; it can solve the problem that the gene that is sensitive to the target treatment method cannot be determined, which leads to the inability to determine whether the patient is sensitive to the target treatment method, and the treatment effect may be poor; it can screen out the interaction effect with the target treatment method with statistics To construct gene labels with sensitive genes of scientific significance; using gene labels can accurately predict patients who are sensitive to the target treatment.
另外,通过使用交叉验证方案可以使用较小的样本数量来发现敏感性基因,可以解决实际样本量较小的问题。In addition, by using a cross-validation scheme, a smaller sample size can be used to discover sensitive genes, which can solve the problem of a smaller actual sample size.
图4是本申请一个实施例提供的敏感性基因发现装置的框图。该装置至少包括以下几个模块:数据获取模块410、模型获取模块420和基因发现模块430。Fig. 4 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application. The device includes at least the following modules: a data acquisition module 410, a model acquisition module 420, and a gene discovery module 430.
数据获取模块410,用于获取患者数据集,所述患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,所述临床数据包括是否采用目标治疗手段;The data acquisition module 410 is used to acquire a patient data set. The patient data set includes at least two sets of patient data. Each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient. The clinical data includes whether Use targeted treatment methods;
模型获取模块420,用于获取基因发现模型,所述基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手 段与各个基因表达数据之间的交互作用效应参数,所述基因发现模型用于预测患者对所述目标治疗手段敏感的基因;The model acquisition module 420 is used to acquire a gene discovery model. The model parameters of the gene discovery model include the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect between the target treatment means and each gene expression data Parameters, the gene discovery model is used to predict genes that the patient is sensitive to the target treatment;
基因发现模块430,用于基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合,以确定所述多个基因表达数据中对目标治疗手段敏感的基因。The gene discovery module 430 is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data .
相关细节参考上述方法实施例。For related details, refer to the above method embodiment.
需要说明的是:上述实施例中提供的敏感性基因发现装置在进行敏感性基因发现时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将敏感性基因发现装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的敏感性基因发现装置与敏感性基因发现方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when the sensitive gene discovery device provided in the above embodiments performs sensitive gene discovery, only the division of the above-mentioned functional modules is used as an example for illustration. In actual applications, the above-mentioned function assignments can be divided according to needs. The functional modules are completed, that is, the internal structure of the sensitive gene discovery device is divided into different functional modules to complete all or part of the functions described above. In addition, the sensitive gene discovery device provided in the foregoing embodiment and the embodiment of the sensitive gene discovery method belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
图5是本申请一个实施例提供的敏感性基因发现装置的框图。该装置至少包括处理器501和存储器502。Fig. 5 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application. The device at least includes a processor 501 and a memory 502.
处理器501可以包括一个或多个处理核心,比如:4核心处理器、8核心处理器等。处理器501可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器501也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器501可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器501还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve. The processor 501 may also include a main processor and a coprocessor. The main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 501 may further include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
存储器502可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器502还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器502中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器501所执行以实现本申请中方法实施例提供的敏感性基因发现方法。The memory 502 may include one or more computer-readable storage media, which may be non-transitory. The memory 502 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 501 to realize the sensitivity provided by the method embodiments in this application. Gene discovery method.
在一些实施例中,敏感性基因发现装置还可选包括有:外围设备接口和至少一个外围设备。处理器501、存储器502和外围设备接口之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口相连。示意性地,外围设备包括但不限于:射频电路、触摸显示屏、音频电路、和电源等。In some embodiments, the sensitive gene discovery apparatus may optionally further include: a peripheral device interface and at least one peripheral device. The processor 501, the memory 502, and the peripheral device interface may be connected through a bus or a signal line. Each peripheral device can be connected to the peripheral device interface through a bus, a signal line or a circuit board. Illustratively, peripheral devices include but are not limited to: radio frequency circuits, touch screens, audio circuits, and power supplies.
当然,敏感性基因发现装置还可以包括更少或更多的组件,本实施例对此不作限定。Of course, the sensitive gene discovery device may also include fewer or more components, which is not limited in this embodiment.
可选地,本申请还提供有一种计算机可读存储介质,所述计算机可读存储介质中存储有程序,所述程序由处理器加载并执行以实现上述方法实施例的敏感性基因发现方法。Optionally, the present application also provides a computer-readable storage medium in which a program is stored, and the program is loaded and executed by a processor to implement the sensitive gene discovery method of the foregoing method embodiment.
可选地,本申请还提供有一种计算机产品,该计算机产品包括计算机可读存储介质,所述计算机可读存储介质中存储有程序,所述程序由处理器加载并执行以实现上述方法实施例的敏感性基因发现方法。Optionally, this application also provides a computer product including a computer-readable storage medium in which a program is stored, and the program is loaded and executed by a processor to implement the above-mentioned method embodiments Sensitive gene discovery method.
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The various technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered as the scope of this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (10)

  1. 一种敏感性基因发现方法,其特征在于,所述方法包括:A sensitive gene discovery method, characterized in that the method comprises:
    获取患者数据集,所述患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,所述临床数据包括是否采用目标治疗手段;Acquiring a patient data set, the patient data set including at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted;
    获取基因发现模型,所述基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手段与各个基因表达数据之间的交互作用效应参数,所述基因发现模型用于预测患者对所述目标治疗手段敏感的基因;Obtain a gene discovery model, the model parameters of the gene discovery model including the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment means and each gene expression data, the gene discovery model Genes used to predict that the patient is sensitive to the target treatment;
    基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合,以确定所述多个基因表达数据中对目标治疗手段敏感的基因。Based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
  2. 根据权利要求1所述的方法,其特征在于,所述基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合,以确定所述多个基因表达数据中对目标治疗手段敏感的基因,包括:The method of claim 1, wherein the K-Fold-based cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model to determine the multiple gene expression data Genes that are sensitive to targeted treatments in the media include:
    将所述患者数据集划分为N份第一数据集,其中,第u份第一数据集为在第u次使用所述第一数据集训练时的验证集,其它第一数据集为在第u次使用所述第一数据集训练时的训练集,所述u依次取1至N的整数,所述N为大于1的整数;The patient data set is divided into N first data sets, where the u-th first data set is the verification set when the first data set is used for the u-th training, and the other first data sets are The training set when the first data set is used for training u times, the u takes an integer from 1 to N in turn, and the N is an integer greater than 1;
    对于每次使用所述第一数据集进行的训练,使用对应的训练集再次基于K-Fold交叉验证算法对单基因发现模型进行训练,得到训练后的单基因发现模型、各个基因的交互作用显著性概率;For each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model based on the K-Fold cross-validation algorithm again, and the single-gene discovery model after training is obtained, and the interaction of each gene is significant Probability
    基于所述训练后的单基因发现模型确定训练判定阈值和输入所述基因发现模型的训练基因个数,所述训练判定阈值用于确定对应的验证集中的患者是否为对目标治疗手段敏感的第一患者;The training decision threshold is determined based on the trained single-gene discovery model and the number of training genes input to the gene discovery model. The training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment method. A patient
    按照所述交互作用显著性概率由低到高的顺序,将对应的验证集中训练基因个数个基因表达数据输入所述基因发现模型,得到所述基因发现模型的模型输出结果;According to the descending order of the significance probability of the interaction, inputting the gene expression data of several training genes in the corresponding verification set into the gene discovery model to obtain the model output result of the gene discovery model;
    将所述输出结果与所述训练判定阈值进行比较,以确定所述验证集中的每个患者是否属于第一患者;Comparing the output result with the training determination threshold to determine whether each patient in the verification set belongs to the first patient;
    对于N次使用所述第一数据集进行的训练后得到的各个第一患者,将每个第一患者的k个基因确定为对所述目标治疗手段敏感的基因;所述k个基因基于所述第一患者对应输入至所述基因发现模型的基因确定。For each first patient obtained after training using the first data set N times, the k genes of each first patient are determined as genes sensitive to the target treatment; the k genes are based on all The first patient is determined corresponding to the gene input to the gene discovery model.
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method according to claim 2, wherein the method further comprises:
    将各个第一患者中采用目标治疗手段治疗的患者的第一生存曲线与不采用目标治疗手段的患者的第二生存曲线进行log-rank检验,得到log-rank检验结果;Perform a log-rank test on the first survival curve of the patients treated with the target treatment method and the second survival curve of the patients who do not use the target treatment method among the first patients, and obtain the log-rank test result;
    在log-rank检验结果小于预设显著水平时,触发执行所述将每个第一患者的k个基因确定为对所述目标治疗手段敏感的基因的步骤。When the log-rank test result is less than the preset significance level, trigger execution of the step of determining the k genes of each first patient as genes sensitive to the target treatment method.
  4. 根据权利要求2所述的方法,其特征在于,所述对于每次使用所述第一数据集进行的训练,使用对应的训练集再次基于K-Fold交叉验证算法对单基因发现模型进行训练,得到训练后的单基因发现模型、各个基因的交互作用显著性概率,包括:The method according to claim 2, wherein for each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model again based on the K-Fold cross-validation algorithm, After training, the single-gene discovery model and the significance probability of the interaction of each gene include:
    将第v次对应的训练集划分为M份的第二数据集;Divide the training set corresponding to the vth time into M second data sets;
    基于K-Fold交叉验证算法使用所述第二数据集对预设的单基因发现模型进行M次参数拟合,得到拟合后的参数;Use the second data set to perform M parameter fitting on the preset single-gene discovery model based on the K-Fold cross-validation algorithm to obtain the fitted parameters;
    其中,所述单基因发现模型为每个基因对应的基因发现模型;Wherein, the single-gene discovery model is a gene discovery model corresponding to each gene;
    所述拟合后的参数包括每个基因对应的目标治疗手段效应参数、基因效应参数、目标治疗手段与所述基因的基因表达数据之间的交互作用效应参数和交互作用显著性概率;The parameters after fitting include the target treatment means effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment means and the gene expression data of the gene, and the significance probability of the interaction;
    第p份第二数据集为在第p次使用第二数据集进行拟合时的验证集,其它第二数据集为在第p次使用第二数据集进行拟合时的训练集,所述v为小于或等于所述N的正整数,所述p为依次取1至M的整数,所述M为大于1的整数。The p-th second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting. v is a positive integer less than or equal to the N, the p is an integer from 1 to M in sequence, and the M is an integer greater than 1.
  5. 根据权利要求2所述的方法,其特征在于,所述基于所述训练后的单基因发现模型确定训练判定阈值和输入所述基因发现模型的训练基因个数,包括:The method according to claim 2, wherein the determining the training decision threshold based on the trained single-gene discovery model and the number of training genes input to the gene discovery model comprises:
    对于交互作用效应参数小于0的每个基因按照交互作用显著性概率从小到大的顺序进行排序;For each gene whose interaction effect parameter is less than 0, sort the genes in descending order of the significance probability of the interaction;
    获取多组参数组合,每组参数组合包括基因个数和判定阈值;Obtain multiple sets of parameter combinations, each of which includes the number of genes and the threshold value;
    对于每组参数组合,按照排序后的顺序将所述参数组合中的基因个数个基因表达数据输入所述基因发现模型,得到模型输出结果;所述基因发现模型包括每个基因对应的训练后的单基因发现模型;For each group of parameter combinations, input the gene expression data of the genes in the parameter combination into the gene discovery model in the sorted order to obtain the model output result; the gene discovery model includes the post-training data corresponding to each gene Single-gene discovery model;
    将所述模型输出结果与所述参数组合中的判定阈值进行比较,以确定所述对应的训练集中对所述目标治疗手段敏感的第一患者以及对所述目标治疗手段不敏感的第二患者;The output result of the model is compared with the judgment threshold in the parameter combination to determine the first patient who is sensitive to the target treatment method and the second patient who is not sensitive to the target treatment method in the corresponding training set ;
    将各个第一患者中将采用目标治疗手段治疗的患者的第一生存曲线与不采用目标治疗手段治疗的患者的第二生存曲线进行log-rank检验,得到log-rank检验结果;Perform a log-rank test on the first survival curve of the patients who will be treated with the target treatment method and the second survival curve of the patients who will not be treated with the target treatment method among the first patients, and obtain the log-rank test result;
    从所述多组参数组合对应的log-rank检验结果中,确定log-rank检验结果的最小值对应的参数组合,得到所述训练判定阈值和所述训练基因个数。From the log-rank test results corresponding to the multiple sets of parameter combinations, determine the parameter combination corresponding to the minimum value of the log-rank test result to obtain the training determination threshold and the number of training genes.
  6. 根据权利要求2所述的方法,其特征在于,所述k的取值与第一次使用所述第一数据集进行的训练时得到的训练基因个数相同。The method according to claim 2, wherein the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合之前,还包括:The method according to any one of claims 1 to 6, wherein before the K-Fold-based cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model, the method further comprises:
    对于每组患者数据,将所述患者数据中的基因表达数据依次输入单基因发现模型;For each group of patient data, sequentially input the gene expression data in the patient data into the single gene discovery model;
    对于每个基因表达数据,以所述患者数据的生存数据为所述单基因发现模型的输出结果,确定所述基因表达数据对应的交互作用效应参数和交互作用显著性概率;For each gene expression data, using the survival data of the patient data as the output result of the single gene discovery model, determine the interaction effect parameter and the interaction significance probability corresponding to the gene expression data;
    将所述交互作用效应参数大于或等于0,且所述交互作用显著性概率大于预设显著值的基因筛除。The genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than a preset significance value are screened out.
  8. 根据权利要求1至6任一所述的方法,其特征在于,所述基因发现模型为比例风险回归模型或者为logistic模型。The method according to any one of claims 1 to 6, wherein the gene discovery model is a proportional hazard regression model or a logistic model.
  9. 一种敏感性基因发现装置,其特征在于,所述装置包括:A sensitive gene discovery device, characterized in that the device comprises:
    数据获取模块,用于获取患者数据集,所述患者数据集包括至少两组患者数据,每组患者数据包括对应患者的多个基因表达数据、临床数据和生存数据,所述临床数据包括是否采用目标治疗手段;The data acquisition module is used to acquire a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether to use Target treatment
    模型获取模块,用于获取基因发现模型,所述基因发现模型的模型参数包括目标治疗手段效应参数、各个基因表达数据的基因效应参数和目标治疗手段与各个基因表达数据之间的交互作用效应参数,所述基因发现模型用于预测患者对所述目标治疗手段敏感的基因;The model acquisition module is used to acquire a gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data , The gene discovery model is used to predict genes that the patient is sensitive to the target treatment method;
    基因发现模块,用于基于K-Fold交叉验证算法使用所述患者数据集对所述基因发现模型的模型参数进行拟合,以确定所述多个基因表达数据中对目标治疗手段敏感的基因。The gene discovery module is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
  10. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有程序,所述程序被处理器执行时用于实现如权利要求1至7任一项所述的敏感性基因发现方法。A computer-readable storage medium, wherein a program is stored in the storage medium, and the program is used to implement the sensitive gene discovery method according to any one of claims 1 to 7 when the program is executed by a processor.
PCT/CN2020/126658 2019-11-26 2020-11-05 Sensitive gene discovery method, device and storage medium WO2021103973A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911170131.3 2019-11-26
CN201911170131.3A CN110880355B (en) 2019-11-26 2019-11-26 Sensitivity gene discovery method, device and storage medium

Publications (1)

Publication Number Publication Date
WO2021103973A1 true WO2021103973A1 (en) 2021-06-03

Family

ID=69729151

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/126658 WO2021103973A1 (en) 2019-11-26 2020-11-05 Sensitive gene discovery method, device and storage medium

Country Status (2)

Country Link
CN (1) CN110880355B (en)
WO (1) WO2021103973A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115346656A (en) * 2022-06-10 2022-11-15 江门市中心医院 Three-group chemistry IDC (internet data center) prognosis model establishing method and prognosis model system based on CAFs (computer aided design), WSIs (wireless sensors and information systems) and clinical information

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880355B (en) * 2019-11-26 2023-08-01 苏州大学 Sensitivity gene discovery method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573410A (en) * 2015-01-20 2015-04-29 合肥工业大学 Cancer chemosensitivity prediction technique based on molecular subnet and random forest classifier
US20160224723A1 (en) * 2015-01-29 2016-08-04 The Trustees Of Columbia University In The City Of New York Method for predicting drug response based on genomic and transcriptomic data
CN107609326A (en) * 2017-07-26 2018-01-19 同济大学 Drug sensitivity prediction method in the accurate medical treatment of cancer
CN110880355A (en) * 2019-11-26 2020-03-13 苏州大学 Sensitive gene discovery method, device and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101215602B (en) * 2007-12-28 2013-01-23 上海敏芯信息科技有限公司 Method for screening gene chip difference expression gene
WO2017198685A1 (en) * 2016-05-18 2017-11-23 Université Libre de Bruxelles Method for determining sensitivity to a cdk4/6 inhibitor
CN109346181B (en) * 2018-08-15 2021-08-17 上海长海医院 Radiotherapy sensitivity marker gene screening method for balancing clinical confounding factors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573410A (en) * 2015-01-20 2015-04-29 合肥工业大学 Cancer chemosensitivity prediction technique based on molecular subnet and random forest classifier
US20160224723A1 (en) * 2015-01-29 2016-08-04 The Trustees Of Columbia University In The City Of New York Method for predicting drug response based on genomic and transcriptomic data
CN107609326A (en) * 2017-07-26 2018-01-19 同济大学 Drug sensitivity prediction method in the accurate medical treatment of cancer
CN110880355A (en) * 2019-11-26 2020-03-13 苏州大学 Sensitive gene discovery method, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIANG QINGHUA , WANG YAMIN ,TANG ZAIXIANG: "Study on the relationship between HSPB1 gene expression and radiosensitivity sensitivity based on survival analysis model", ACTA UNIVERSITATIS MEDICINALIS ANHUI, vol. 54, no. 2, 11 January 2019 (2019-01-11), pages 261 - 266, XP055817548, ISSN: 1000-1492, DOI: 10.19405/j.cnki.issn1000-1492.2019.02.020 *
JIANG QINGHUA: "Study on the Relationship Between HSPB1 Gene Expression and Radiosensitivity Sensitivity based on Survival Analysis Model", CHINESE JOURNAL OF CANCER PREVENTION AND TREATMENT, vol. 26, no. 17, 11 January 2019 (2019-01-11), pages 261 - 266, XP009528320, ISSN: 1673-5269 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115346656A (en) * 2022-06-10 2022-11-15 江门市中心医院 Three-group chemistry IDC (internet data center) prognosis model establishing method and prognosis model system based on CAFs (computer aided design), WSIs (wireless sensors and information systems) and clinical information
CN115346656B (en) * 2022-06-10 2023-10-27 江门市中心医院 Three-group IDC prognosis model building method and prognosis model system based on CAFs, WSIs and clinical information

Also Published As

Publication number Publication date
CN110880355A (en) 2020-03-13
CN110880355B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
Mobadersany et al. Predicting cancer outcomes from histology and genomics using convolutional networks
Gao et al. DeepCC: a novel deep learning-based framework for cancer molecular subtype classification
Qiu et al. The impact of quantile and rank normalization procedures on the testing power of gene differential expression analysis
Qiu et al. A meta-learning approach for genomic survival analysis
Tsai et al. Multivariate analysis of variance test for gene set analysis
Marini et al. Semi-supervised training of deep convolutional neural networks with heterogeneous data and few local annotations: An experiment on prostate histopathology image classification
Sidiropoulos et al. Real time decision support system for diagnosis of rare cancers, trained in parallel, on a graphics processing unit
Beheshti et al. A microRNA signature and TGF-β1 response were identified as the key master regulators for spaceflight response
Reverter et al. Validation of alternative methods of data normalization in gene co-expression studies
WO2021103973A1 (en) Sensitive gene discovery method, device and storage medium
Titus et al. A new dimension of breast cancer epigenetics
Thomas et al. Probing for sparse and fast variable selection with model‐based boosting
Di Camillo et al. Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment
Zhang et al. Optimal transport analysis reveals trajectories in steady-state systems
Li et al. DeTOKI identifies and characterizes the dynamics of chromatin TAD-like domains in a single cell
Sheehy et al. Gynecological cancer prognosis using machine learning techniques: A systematic review of last three decades (1990–2022)
Xi et al. A novel network regularized matrix decomposition method to detect mutated cancer genes in tumour samples with inter-patient heterogeneity
Molstad et al. Gaussian process regression for survival time prediction with genome-wide gene expression
Liu et al. Survival time prediction of breast cancer patients using feature selection algorithm crystall
Liang et al. Pathway centric analysis for single-cell RNA-seq and spatial transcriptomics data with GSDensity
Kaushik et al. Robust biomarker screening using spares learning approach for liver cancer prognosis
Yang et al. MSPL: Multimodal self-paced learning for multi-omics feature selection and data integration
Gao Construction of null statistics in permutation-based multiple testing for multi-factorial microarray experiments
Sharma et al. A Comparative Study of Data Mining, Digital Image Processing and Genetical Approach for Early Detection of Liver Cancer
Lin et al. Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20893332

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20893332

Country of ref document: EP

Kind code of ref document: A1