WO2021103973A1

WO2021103973A1 - Sensitive gene discovery method, device and storage medium

Info

Publication number: WO2021103973A1
Application number: PCT/CN2020/126658
Authority: WO
Inventors: 汤在祥; 顾金成; 曹建平; 杨巍; 聂继华; 焦旸
Original assignee: 苏州大学
Priority date: 2019-11-26
Filing date: 2020-11-05
Publication date: 2021-06-03
Also published as: CN110880355A; CN110880355B

Abstract

A sensitive gene discovery method, a device and a storage medium, belonging to the field of biotechnology. The method comprises: acquiring a patient data set, the patient data set comprising at least two groups of patient data, each group of patient data comprising multiple gene expression data, clinical data and survival data corresponding to a patient, and the clinical data comprising whether a target treatment means is adopted (101); acquiring a gene discovery model, model parameters of the gene discovery model comprising an effect parameter of the target treatment means, a gene effect parameter of each gene expression data and an interaction effect parameter between the target treatment means and each gene expression data, and the gene discovery model being used for predicting a gene of a patient being sensitive to the target treatment means (102); according to a K-Fold cross-validation algorithm, fitting the model parameters of the gene discovery model by using the patient data set to determine genes that are sensitive to the target treatment means among a plurality of gene expression data (103). The described method can solve the problem of being impossible to determine whether a patient is sensitive to a target treatment means due to a gene sensitive to the target treatment means being unable to be determined; and can screen out the gene sensitive to the treatment treatment means.

Description

Sensitive gene discovery method, device and storage medium

This application claims the priority of the Chinese patent application whose application date is November 26, 2019 and the application number is 201911170131.3, the entire content of which is incorporated into this application by reference.

Technical field

This application relates to a method, device and storage medium for discovering sensitive genes, and belongs to the field of biotechnology.

Background technique

High-throughput sequencing technology, also known as "next generation" sequencing technology, is marked by the ability to sequence hundreds of thousands to millions of DNA molecules at a time and the read length is relatively short. The rapid development of high-throughput sequencing technology has made it possible to predict the risk of disease.

Radiotherapy is a local treatment method that uses radiation of different energy to treat tumors. Its role and position in tumor treatment have become increasingly prominent, and it has become the main method for treating malignant tumors.

However, some patients may not be sensitive to radiotherapy. At this time, if radiotherapy is used for these patients, the desired therapeutic effect cannot be achieved. Therefore, there is an urgent need for a gene discovery method that is sensitive to target treatments (such as radiotherapy).

Summary of the invention

This application provides a sensitive gene discovery method, device and storage medium, which can solve the problem that genes that are sensitive to the target treatment cannot be determined, which leads to the inability to determine whether the patient is sensitive to the target treatment, and the treatment effect may be poor. This application provides the following technical solutions:

In the first aspect, a method for discovering sensitive genes is provided, and the method includes:

Acquiring a patient data set, the patient data set including at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted;

Obtain a gene discovery model, the model parameters of the gene discovery model including the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment means and each gene expression data, the gene discovery model Genes used to predict that the patient is sensitive to the target treatment;

Based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data.

Optionally, the K-Fold-based cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data, include:

The patient data set is divided into N first data sets, where the u-th first data set is the verification set when the first data set is used for the u-th training, and the other first data sets are The training set when the first data set is used for training u times, the u takes an integer from 1 to N in turn, and the N is an integer greater than 1;

For each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model based on the K-Fold cross-validation algorithm again, and the single-gene discovery model after training is obtained, and the interaction of each gene is significant Probability

The training decision threshold is determined based on the trained single-gene discovery model and the number of training genes input to the gene discovery model. The training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment method. A patient

According to the descending order of the significance probability of the interaction, inputting the gene expression data of several training genes in the corresponding verification set into the gene discovery model to obtain the model output result of the gene discovery model;

Comparing the output result with the training determination threshold to determine whether each patient in the verification set belongs to the first patient;

For each first patient obtained after training using the first data set N times, the k genes of each first patient are determined as genes sensitive to the target treatment; the k genes are based on all The first patient is determined corresponding to the gene input to the gene discovery model.

Optionally, the method further includes:

Perform a log-rank test on the first survival curve of the patients treated with the target treatment method and the second survival curve of the patients who do not use the target treatment method among the first patients, and obtain the log-rank test result;

When the log-rank test result is less than the preset significance level, trigger execution of the step of determining the k genes of each first patient as genes sensitive to the target treatment method.

Optionally, for each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model again based on the K-Fold cross-validation algorithm to obtain the trained single-gene discovery model, The significance probability of the interaction of each gene includes:

Divide the training set corresponding to the vth time into M second data sets;

Use the second data set to perform M parameter fitting on the preset single-gene discovery model based on the K-Fold cross-validation algorithm to obtain the fitted parameters;

Wherein, the single-gene discovery model is a gene discovery model corresponding to each gene;

The parameters after fitting include the target treatment means effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment means and the gene expression data of the gene, and the significance probability of the interaction;

The p-th second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting. v is a positive integer less than or equal to the N, the p is an integer from 1 to M in sequence, and the M is an integer greater than 1.

Optionally, the determining the training determination threshold based on the trained single-gene discovery model and the number of training genes input to the gene discovery model include:

For each gene whose interaction effect parameter is less than 0, sort the genes in descending order of the significance probability of the interaction;

Obtain multiple sets of parameter combinations, each of which includes the number of genes and the threshold value;

For each group of parameter combinations, input the gene expression data of the genes in the parameter combination into the gene discovery model in the sorted order to obtain the model output result; the gene discovery model includes the post-training data corresponding to each gene Single-gene discovery model;

The output result of the model is compared with the judgment threshold in the parameter combination to determine the first patient who is sensitive to the target treatment method and the second patient who is not sensitive to the target treatment method in the corresponding training set ；

Perform a log-rank test on the first survival curve of the patients who will be treated with the target treatment method and the second survival curve of the patients who will not be treated with the target treatment method among the first patients, and obtain the log-rank test result;

From the log-rank test results corresponding to the multiple sets of parameter combinations, determine the parameter combination corresponding to the minimum value of the log-rank test result to obtain the training determination threshold and the number of training genes.

Optionally, the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.

Optionally, before the K-Fold cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model, the method further includes:

For each group of patient data, sequentially input the gene expression data in the patient data into the single gene discovery model;

For each gene expression data, using the survival data of the patient data as the output result of the single gene discovery model, determine the interaction effect parameter and the interaction significance probability corresponding to the gene expression data;

The genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than a preset significance value are screened out.

Optionally, the gene discovery model is a proportional hazard regression model or a logistic model.

In a second aspect, a sensitive gene discovery device is provided, the device comprising:

The data acquisition module is used to acquire a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether to use Target treatment

The model acquisition module is used to acquire a gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data , The gene discovery model is used to predict genes that the patient is sensitive to the target treatment method;

The gene discovery module is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data.

In a third aspect, a sensitive gene discovery device is provided. The device includes a processor and a memory; the memory stores a program, and the program is loaded and executed by the processor to realize the sensitive gene described in the first aspect. Sex gene discovery method.

In a fourth aspect, a computer-readable storage medium is provided, and a program is stored in the storage medium, and the program is loaded and executed by the processor to realize the sensitive gene discovery method described in the first aspect.

The beneficial effect of this application is that by acquiring a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether the target treatment is adopted Means: Get the gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data. The gene discovery model is used Predict the genes that patients are sensitive to the target treatment; based on the K-Fold cross-validation algorithm, use the patient data set to fit the model parameters of the gene discovery model to determine the genes that are sensitive to the target treatment in multiple gene expression data; it can be resolved Unable to determine the genes that are sensitive to the target treatment, which leads to the problem that it is impossible to determine whether the patient is sensitive to the target treatment, and the treatment effect may be poor; it is possible to screen out sensitive genes with statistically significant interaction effects with the target treatment to construct a gene label; The use of gene tags can accurately predict patients who are sensitive to the target treatment.

The above description is only an overview of the technical solution of the present application. In order to understand the technical means of the present application more clearly and implement it in accordance with the content of the description, the following detailed descriptions are given below with the preferred embodiments of the present application in conjunction with the accompanying drawings.

Description of the drawings

Fig. 1 is a flowchart of a method for discovering sensitive genes provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of determining the number of training genes and the training judgment threshold provided by an embodiment of the present application;

Fig. 3 is a flowchart of a method for discovering sensitive genes provided by another embodiment of the present application;

Figure 4 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application;

Fig. 5 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application.

Detailed ways

The specific implementation of the present application will be described in further detail below in conjunction with the accompanying drawings and embodiments. The following examples are used to illustrate the application, but are not used to limit the scope of the application.

First, introduce several terms involved in this application:

Sensitivity to target treatment means: It refers to the characteristic that patients can obtain better survival benefits under the condition of treatment with target treatment means. For example: Radiosensitivity (Radiosensitivity): in the case of receiving radiotherapy, some patients can obtain better survival benefits. Among them, radiotherapy (Radiotherapy): referred to as radiotherapy, is one of the treatment methods for malignant tumors, which is to irradiate the tumor with rays of different energy to inhibit and kill cancer cells.

Gene signature: A collection of genes, usually related to disease diagnosis and prognosis, and can be used to gain a deeper understanding of diseases from the perspective of genetic factors.

Cross-validation: Mainly used in modeling applications. Most of the samples are selected as the training set from the given modeling samples, and the remaining part is used as the validation set. The training set is used to train the classifier. Then use the validation set to test the trained model to evaluate the pros and cons of the classifier.

Hazard ratio: The ratio of the variable's risk rate at the exposure level to the risk rate at the non-exposure level. It can not only record the result of an event, but also reflect the time it takes for the event to occur.

Precision medicine: It is based on the patient's internal biological information and clinical symptoms and signs to implement tailor-made plans for the patient's health care and clinical decision-making.

Probability of interaction significance: Rejecting the null hypothesis interaction does not have a significant impact on survival, and accepting the alternative hypothesis interaction has a significant impact on survival. That is, the probability that the interaction has a significant impact on survival is a wrong approach. The higher the significance of the interaction, the smaller the P value of the significance probability of the interaction. The smaller the P value of the significance probability of the interaction, the lower the possibility that this approach is wrong. Usually through gene discovery models, the interaction terms between gene expression data and the target treatment are added to the model, and the significance probability of each effect including the interaction terms can be obtained.

Proportional hazards model: Also known as Cox regression model, it is a semiparametric regression model proposed by British statistician D.R.Cox in 1972. This model can be used to describe the impact of multiple characteristics that do not change over time on the mortality rate at a certain moment. It is an important model in survival analysis.

Logistic regression model: also known as logistic regression analysis, mainly used in epidemiology. For example, if you want to explore the risk factors of gastric cancer, you can choose two groups of people, one group is gastric cancer group, the other group is non-gastric cancer group, the two groups of people must have different physical signs and lifestyles.

Survival curve: Take the follow-up time as the horizontal axis and the survival rate as the vertical axis, connecting the points to form a curve.

Log-rank test (also known as log-rank test): used to compare survival data between groups. The test statistic is chi-square.

Optionally, this application takes an electronic device as an example for the execution of each embodiment. The electronic device may be a terminal device such as a computer, a tablet, or a mobile phone; alternatively, it may also be a server, etc. This embodiment does not apply to electronic devices. The type is limited.

Fig. 1 is a flowchart of a method for discovering sensitive genes provided by an embodiment of the present application. The method includes at least the following steps:

Step 101: Obtain a patient data set. The patient data set includes at least two sets of patient data. Each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted.

Among them, the target treatment method may be radiotherapy; of course, it may also be other gene-related treatment methods, which is not limited in this embodiment.

Gene expression data reflects the abundance of gene transcript mRNA in cells measured directly or indirectly. These data can be used to analyze which genes have changed their expression, what are the correlations between genes, and how the activities of genes are affected under different conditions. They have important applications in medical clinical diagnosis, drug efficacy judgment, and revealing the mechanism of disease. At present, the methods for high-throughput detection of genomic mRNA abundance are mainly cDNA microarrays and oligonucleotide chips. With the development of high-throughput detection technologies such as cDNA microarrays and oligonucleotide chips, we can start from the whole genome level. Quantitative or qualitative detection of gene transcription product mRNA. For example, referring to the gene expression data in each group of patient data shown in Table 1, the gene expression data in each group of patient data includes: the patient's standardized multi-gene mRNA expression value two-dimensional data set, and the rows of the two-dimensional data set The variables represent different patients (Table 1 takes 371 patients as an example), and the column variables represent the gene expression values of different genes (Table 1 takes the gene expression values of 14343 genes as an example).

Table I:

To	基因1Gene 1	基因2Gene 2	基因3Gene 3	……	基因14343Gene 14343
患者1Patient 1	-0.1419-0.1419	-0.5401-0.5401	-0.5733-0.5733	……	-0.5251-0.5251
患者2Patient 2	-0.4113-0.4113	-0.5112-0.5112	-0.5161-0.5161	……	-0.5347-0.5347
……	……	……	……	……	……
患者371Patient 371	-0.5173-0.5173	-0.5173-0.5173	-0.4237-0.4237	……	-0.5754-0.5754

Clinical data is used to indicate the patient's clinical performance. Illustratively, the clinical data includes whether the target treatment is used or not. Of course, clinical data also includes, but is not limited to: the patient's age at diagnosis for a certain disease, histological classification, pathological classification, and/or medication use. For example, referring to the clinical data in each group of patient data shown in Table 2, the clinical data in each group of patient data includes: a two-dimensional clinical data set of various types of patients, and the row variables of the two-dimensional clinical data set represent different For patients (Table 2 takes 371 patients as an example), the column variables represent different types of clinical data (Table 2 takes 5 types of clinical data as examples). Among them, for different types of diseases, the corresponding histological classification is different, such as: for thyroid adenoma, histological classification includes: follicular adenoma (including simple adenoma and eosinophilic adenoma); papillary adenoma ; Medullary tumors and undifferentiated tumors; for lung cancer, histological classification includes: squamous cell carcinoma, adenocarcinoma, small cell carcinoma, large cell carcinoma, this embodiment does not limit the histological classification. For different types of diseases, the corresponding pathological classifications are different. For example, the pathological classifications of cirrhosis include: small nodular cirrhosis, large nodular cirrhosis, large and small nodular cirrhosis, incomplete separation There are four types of liver cirrhosis; the pathological classification of fatty liver includes: simple fatty liver, steatohepatitis, fatty liver fibrosis, and fatty liver cirrhosis. This embodiment does not limit the pathological classification.

Table II:

Survival data is used to indicate the survival of patients. Survival data includes the patient's survival outcome and the length of time that has elapsed. Among them, the survival outcome can be a survival risk value or a binary value (that is, survival or death). Based on the number of patients in Table 1 and Table 2, there are a total of 371 corresponding survival outcomes.

Step 102: Obtain a gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data, and the gene discovery model Genes used to predict the patient's sensitivity to the target treatment.

Optionally, the gene discovery model is a proportional hazard regression model or a logistic model. When the gene discovery model is a proportional hazard regression model, the model output is the survival risk value; when the gene discovery model is a logistic model, the model output is a binary value.

Schematically, the gene discovery model is represented by the following formula:

h(t|X)=h ₀ (t)exp(rλ+x ₁ b ₁ +x ₂ b ₂ +…+x _s b _s +rx ₁ i ₁ +rx ₂ i ₂ +…+rx _s i _s )

Among them, h ₀ (t) is the basic risk function; λ is the effect parameter of the target treatment method; r is used to indicate whether to accept the target treatment method, if the target treatment method is accepted, the value of r is 1; if the target treatment method is not accepted , Then the value of r is 0; x ₁ , x ₂ ,...x _s are the gene expression data corresponding to each gene; b ₁ , b ₂ ,...b _s are the gene effect parameters of each gene; i ₁ , i ₂ , ... i _s treatment with the respective target gene interaction effects between the parameter data, means for reflecting the target treatment effect on survival by the influence of the level of expression of the corresponding gene.

For any patient, the hazard ratio (HR) is exp(r ^λ + x _j b _j + rx _j i _j ). If the interaction effect parameter is a negative value, then the HR may be less than 1. At this time, the survival rate of patients who are sensitive to the target treatment is higher than the survival rate of patients who do not receive the target treatment. If some patients have genes that are sensitive to the target treatment, their total risk ratio (nHR) will tend to be very small, less than the preset threshold. Then, these patients with a relatively high expected survival rate are the first patients who are sensitive to the target treatment, and the subset of gene expression data entered into the gene discovery model is the gene expression data of sensitive genes that are sensitive to the target treatment.

Based on the foregoing principles, it can be known that in this embodiment, the gene discovery model can be trained to obtain sensitive genes that are sensitive to the target treatment.

Step 103: Use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm, so as to determine genes that are sensitive to the target treatment method in the multiple gene expression data.

In one example, the electronic device can use all the gene expression data of the patient to fit the model parameters of the gene discovery model.

In another example, the electronic device can perform a preliminary screening of all gene expression data of the patient to screen out genes that are obviously insensitive to the target treatment. At this time, before this step, for each group of patient data, the patient The gene expression data in the data are sequentially input into the single gene discovery model; for each gene expression data, the survival data of the patient data is used as the output result of the single gene discovery model to determine the interaction effect parameters and the significance of the interaction effect corresponding to the gene expression data Probability: The genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than the preset significance value are screened out.

Among them, the single-gene discovery model refers to the model corresponding to each gene, and the type of its model parameter is the same as that of the gene discovery model. The effect parameter of the target treatment in the gene discovery model is the average of the effect parameters of the target treatment in each single gene discovery model; the gene effect parameter of each gene in the gene discovery model is the gene corresponding to the single gene discovery model. Effect parameter; the interaction effect parameter of each gene in the gene discovery model is the interaction effect parameter of the single gene discovery model corresponding to the gene.

At this time, the electronic device uses the patient’s gene expression data to multiply the data on whether the target treatment method is used or not to obtain the interaction item between the target treatment method and each gene expression data; for each gene, construct survival data and target A single-gene discovery model of interaction terms between treatment means, gene expression data, target treatment means and each gene expression data. Take the gene expression data in Table 1 as an example. Since there are 14343 gene expression data, 14343 single-gene discovery models need to be constructed. Each single-gene discovery model can obtain various effect parameters, standard errors and interactions of effect parameters. Probability of significance. The preliminary screening of sensitive genes that are sensitive to the target treatment includes: determining whether the interaction effect parameter is less than 0, and whether the interaction significance probability of the interaction item is less than the preset significant value (such as 0.05) is determined as the sensitive gene , That is, the genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than the preset significance value are screened out. Suppose that after screening the gene expression data described in Table 1, 268 sensitive genes that meet the above requirements are screened out.

Wherein, the preset significant value is described by taking 0.05 as an example. In actual implementation, the preset significant value may also be other values, and this embodiment does not limit the value of the preset significant value.

Among them, based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine the genes that are sensitive to the target treatment in multiple gene expression data, including:

1. Divide the patient data set into N first data sets, where the u-th first data set is the verification set when the first data set is used for the u-th training, and the other first data sets are in the u-th training. The training set when the first data set is used for training u times, u takes an integer from 1 to N in turn, and N is an integer greater than 1.

Among them, the sizes of the N first data sets are the same or different. Take 371 patients as an example. For example, divide the patient data set into 10 (ie, N=10) samples of the same size, and do 10-fold cross-validation. For the first training, nine of them, that is, the patient data of 334 patients, are used as the training set to fit the model parameters in the gene discovery model, and the other one, that is, the patient data of 37 patients is used as the validation set. The combined gene discovery model is used to calculate the nHR value; in the second training, one of the 9 patient data used for training for the first time is used as the validation set, and the 37 patient data for the first time as the validation set is added to the training set , And re-fit the gene discovery model to calculate the nHR value. A total of 10 trainings and 10 exchanges of validation sets cover all patient data, resulting in a complete nHR value of 334 patients.

Of course, the value of N is also another value, and this embodiment does not limit the value of N.

2. For each training using the first data set, use the corresponding training set to train the single-gene discovery model based on the K-Fold cross-validation algorithm again, and obtain the single-gene discovery model after training, and the interaction of each gene is significant Probability.

In this application, the K-Fold cross-validation algorithm is nested again during each cross-validation training process. Schematically, the training set corresponding to the vth time is divided into M second data sets; based on the K-Fold cross-validation algorithm, the second data set is used to perform M parameter fitting on the preset single-gene discovery model to obtain Parameters after fitting.

Among them, the parameters after fitting include the target treatment effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment method and the gene expression data of the gene, and the significance probability of the interaction; The second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting, and v is less than or equal to N A positive integer of, p is an integer from 1 to M in sequence, and M is an integer greater than 1. The size of the M second data sets is the same or different.

For example, based on the above example, for each training using the first data set, 334 samples are divided into 10 samples of the same size, and 10 10-fold cross-validation is performed (ie, M=10). In this embodiment, the purpose of internal nesting training is to find the optimal parameter combination of the number of genes and the determination threshold. Similarly, each training uses 9 pieces of patient data for training and one piece of patient data for verification, but the data of the patient data is different. Nine copies are 301 patient data, and one copy is 33 patient data. For each gene (268 in total after screening), a single gene discovery model was constructed for survival data, target treatment methods, gene expression data, and interaction terms between target treatment methods and each gene expression data. Select the genes that have a negative value for the interaction effect, and arrange them in descending order of the significance probability of the interaction.

Of course, the value of M is also another value, and this embodiment does not limit the value of N.

3. Determine the training decision threshold and the number of training genes of the input gene discovery model based on the trained single gene discovery model. The training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment.

For each gene whose interaction effect parameter is less than 0, sort in descending order of the significance probability of the interaction; obtain multiple sets of parameter combinations, and each set of parameter combinations includes the number of genes and the judgment threshold; for each set of parameter combinations, follow In the sorted order, input the gene expression data of the genes in the parameter combination into the gene discovery model to obtain the model output; the gene discovery model includes the trained single gene discovery model corresponding to each gene; the model output results are combined with the parameters Compare the judgment thresholds in the corresponding training set to determine the first patient who is sensitive to the target treatment method and the second patient who is not sensitive to the target treatment method in the corresponding training set; for each first patient, the first patient is treated with the target treatment method The log-rank test is performed on the first survival curve of the patients and the second survival curve of the patients who do not use the target treatment method, and the log-rank test results are obtained; from the log-rank test results corresponding to the multiple sets of parameter combinations, the log-rank test results are determined. The parameter combination corresponding to the minimum value of the rank test result obtains the training judgment threshold and the number of training genes.

For example, the range of the number of genes is [2, 150], and the range of the threshold value of the judgment threshold is [0.01, 0.5]. Determine the number of genes from the number value range, and determine the judgment threshold value composition parameter combination from the threshold value range, resulting in 4750 parameter combinations. According to the nested training steps shown in step 2, the effect parameters of each single gene discovery model are obtained. Apply each effect parameter to the gene discovery model, and calculate the nHR value of the patient under various parameter combinations in this outermost training (that is, using the first data set for training). According to the size between the nHR and the corresponding judgment threshold, it is judged whether the patient is the first patient who is sensitive to the target treatment method or the second patient who is not sensitive to the target treatment method. For the first patient, the log-rank test of the survival curve of patients who used the target treatment method and the patient who did not use the target treatment method was compared. In the log-rank test, the log-rank test has the smallest number of corresponding genes and the parameter combination with the optimal threshold. Referring to the number of training genes and the training determination threshold automatically selected by the electronic device shown in FIG. 2, the number of training genes is 33, and the training determination threshold is 0.01.

4. According to the descending order of the significance probability of the interaction, input several gene expression data of the training genes in the corresponding verification set into the gene discovery model, and obtain the model output results of the gene discovery model.

5. Compare the output result with the training decision threshold to determine whether each patient in the validation set belongs to the first patient.

According to the number of training genes determined in step 3 and the training decision threshold R, the model output HR(nHR) of the first g genes of the patient is calculated in the validation set of nested training as follows:

Among them, λ is the effect parameter of the target treatment, and λ is the average value estimated by g (33 in Figure 2) single gene model. If the nHR is less than 0.01, the patient is classified as the first patient who is sensitive to the target treatment; if the nHR is greater than or equal to 0.01, the patient is classified as the second patient who is not sensitive to the target treatment. Since each patient will appear once in the verification set, all patients will be classified as the first patient or the second patient after the training process of steps 1 to 4.

6. For each first patient obtained after training using the first data set N times, determine the k genes of each first patient as genes that are sensitive to the target treatment; among them, the k genes are based on the first patient Corresponding to the gene determination input to the gene discovery model.

Optionally, for each first patient obtained after training using the first data set N times, the first survival curve of the patients treated with the target treatment method among the first patients is compared with the first survival curve of the patients who do not use the target treatment method. Perform log-rank test on the second survival curve to get the log-rank test result; when the log-rank test result is less than the preset significance level, execute the k genes of each first patient as genes that are sensitive to the target treatment A step of.

Optionally, the process ends when the log-rank test result is greater than or equal to the preset significance level.

For example, for the first predicted patient, compare the survival curves of patients who received the target treatment with those who did not receive the target treatment at a specified significance level (ie, the preset significance level) through a log-rank test. If the survival condition is significantly improved (that is, the test result of the log-rank test result is less than the significant level), it indicates that the target treatment method is beneficial to the first patient, that is, the gene signature is effective, and the prediction of the first patient is also accurate. This example predicts 371 patients, of which 174 belong to the first patient and 197 belong to the second patient.

Among them, the preset significance level can be 0.05, of course, it can also be other values, and this embodiment does not limit the value of the preset significance level.

In an example, the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.

Of course, the value of k can also be the average value of the number of training genes obtained when the first data set is used for training N times. This embodiment does not limit the value of k.

Optionally, in order to understand the sensitive gene discovery method provided in this application more clearly, the method will be described below with an example. Refer to the sensitive gene discovery process shown in Figure 3, which includes at least steps 31-41. :

Step 31: Divide the patient data set into N parts, where N-1 part is the training set, and 1 part is the validation set;

Step 32: Divide the N-1 training set into M, M-1 as the training set, and 1 as the validation set;

Step 33: Use M-1 copies as the training set to construct a single-gene discovery model of survival data, target treatment methods, gene expression data, interaction terms between target treatment methods and each gene expression data, and obtain fitted parameters; Among them, the parameters after fitting include the target treatment effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment method and the gene expression data of the gene, and the significance probability of the interaction;

Step 34: Use the validation set obtained in step 32 to verify the model output result of the gene discovery model;

Step 35, determine whether the current number of training times reaches M; if yes, go to step 36; if not, go to step 32 again;

It needs to be supplemented that the verification set obtained for each division in step 32 is different.

Step 36: Use the model output result of the training set to calculate the gene discovery model obtained in step 31;

Step 37: Divide the patients into the first patient and the second patient according to the different parameter combinations of the number of genes g and the judgment threshold R, perform a log-rank test on the first patient, and compare the patients receiving the target treatment with those in the first patient Differences in survival rates between patients who did not receive the target treatment; determine the number of training genes g corresponding to the minimum log-rank test result and the training judgment threshold R;

Step 38: Use the validation set obtained in step 31, the number of training genes g determined in step 36, and the training determination threshold R to recalculate the model output result of the gene discovery model;

Step 39: Determine whether the current number of training times reaches N times; if yes, go to step 40; if not, go to step 31 again;

It needs to be supplemented that the verification set obtained for each division in step 31 is different.

Step 40: Compare the output result of the model obtained after N training with the corresponding training judgment threshold, and output the gene label including g genes and each first patient;

Wherein, the training decision threshold may be the training decision threshold determined by the first outer training.

Step 41: Perform a log-rank test on the first patient to compare the survival rate difference between patients receiving the target treatment and those not receiving the target treatment; if the difference is significant, it indicates that the target treatment is beneficial to the first patient, that is, genetic The label is effective and the prediction of the first patient is accurate.

To sum up, the sensitive gene discovery method provided in this embodiment obtains a patient data set. The patient data set includes at least two sets of patient data, and each set of patient data includes multiple gene expression data, clinical data, and survival of the corresponding patient. Data, clinical data include whether the target treatment method is used; obtain the gene discovery model, the model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction between the target treatment method and each gene expression data Effect parameter, the gene discovery model is used to predict the genes that the patient is sensitive to the target treatment; based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine the multiple gene expression data. Genes that are sensitive to the target treatment method; it can solve the problem that the gene that is sensitive to the target treatment method cannot be determined, which leads to the inability to determine whether the patient is sensitive to the target treatment method, and the treatment effect may be poor; it can screen out the interaction effect with the target treatment method with statistics To construct gene labels with sensitive genes of scientific significance; using gene labels can accurately predict patients who are sensitive to the target treatment.

In addition, by using a cross-validation scheme, a smaller sample size can be used to discover sensitive genes, which can solve the problem of a smaller actual sample size.

Fig. 4 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application. The device includes at least the following modules: a data acquisition module 410, a model acquisition module 420, and a gene discovery module 430.

The data acquisition module 410 is used to acquire a patient data set. The patient data set includes at least two sets of patient data. Each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient. The clinical data includes whether Use targeted treatment methods;

The model acquisition module 420 is used to acquire a gene discovery model. The model parameters of the gene discovery model include the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect between the target treatment means and each gene expression data Parameters, the gene discovery model is used to predict genes that the patient is sensitive to the target treatment;

The gene discovery module 430 is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data .

For related details, refer to the above method embodiment.

It should be noted that when the sensitive gene discovery device provided in the above embodiments performs sensitive gene discovery, only the division of the above-mentioned functional modules is used as an example for illustration. In actual applications, the above-mentioned function assignments can be divided according to needs. The functional modules are completed, that is, the internal structure of the sensitive gene discovery device is divided into different functional modules to complete all or part of the functions described above. In addition, the sensitive gene discovery device provided in the foregoing embodiment and the embodiment of the sensitive gene discovery method belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.

Fig. 5 is a block diagram of a sensitive gene discovery device provided by an embodiment of the present application. The device at least includes a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve. The processor 501 may also include a main processor and a coprocessor. The main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 501 may further include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.

The memory 502 may include one or more computer-readable storage media, which may be non-transitory. The memory 502 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 501 to realize the sensitivity provided by the method embodiments in this application. Gene discovery method.

In some embodiments, the sensitive gene discovery apparatus may optionally further include: a peripheral device interface and at least one peripheral device. The processor 501, the memory 502, and the peripheral device interface may be connected through a bus or a signal line. Each peripheral device can be connected to the peripheral device interface through a bus, a signal line or a circuit board. Illustratively, peripheral devices include but are not limited to: radio frequency circuits, touch screens, audio circuits, and power supplies.

Of course, the sensitive gene discovery device may also include fewer or more components, which is not limited in this embodiment.

Optionally, the present application also provides a computer-readable storage medium in which a program is stored, and the program is loaded and executed by a processor to implement the sensitive gene discovery method of the foregoing method embodiment.

Optionally, this application also provides a computer product including a computer-readable storage medium in which a program is stored, and the program is loaded and executed by a processor to implement the above-mentioned method embodiments Sensitive gene discovery method.

The various technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered as the scope of this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A sensitive gene discovery method, characterized in that the method comprises:

Acquiring a patient data set, the patient data set including at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether a target treatment method is adopted;

Obtain a gene discovery model, the model parameters of the gene discovery model including the target treatment means effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment means and each gene expression data, the gene discovery model Genes used to predict that the patient is sensitive to the target treatment;

Based on the K-Fold cross-validation algorithm, the patient data set is used to fit the model parameters of the gene discovery model to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
The method of claim 1, wherein the K-Fold-based cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model to determine the multiple gene expression data Genes that are sensitive to targeted treatments in the media include:

The patient data set is divided into N first data sets, where the u-th first data set is the verification set when the first data set is used for the u-th training, and the other first data sets are The training set when the first data set is used for training u times, the u takes an integer from 1 to N in turn, and the N is an integer greater than 1;

For each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model based on the K-Fold cross-validation algorithm again, and the single-gene discovery model after training is obtained, and the interaction of each gene is significant Probability

The training decision threshold is determined based on the trained single-gene discovery model and the number of training genes input to the gene discovery model. The training decision threshold is used to determine whether the patient in the corresponding verification set is the first patient sensitive to the target treatment method. A patient

According to the descending order of the significance probability of the interaction, inputting the gene expression data of several training genes in the corresponding verification set into the gene discovery model to obtain the model output result of the gene discovery model;

Comparing the output result with the training determination threshold to determine whether each patient in the verification set belongs to the first patient;

For each first patient obtained after training using the first data set N times, the k genes of each first patient are determined as genes sensitive to the target treatment; the k genes are based on all The first patient is determined corresponding to the gene input to the gene discovery model.
The method according to claim 2, wherein the method further comprises:

Perform a log-rank test on the first survival curve of the patients treated with the target treatment method and the second survival curve of the patients who do not use the target treatment method among the first patients, and obtain the log-rank test result;

When the log-rank test result is less than the preset significance level, trigger execution of the step of determining the k genes of each first patient as genes sensitive to the target treatment method.
The method according to claim 2, wherein for each training performed using the first data set, the corresponding training set is used to train the single-gene discovery model again based on the K-Fold cross-validation algorithm, After training, the single-gene discovery model and the significance probability of the interaction of each gene include:

Divide the training set corresponding to the vth time into M second data sets;

Use the second data set to perform M parameter fitting on the preset single-gene discovery model based on the K-Fold cross-validation algorithm to obtain the fitted parameters;

Wherein, the single-gene discovery model is a gene discovery model corresponding to each gene;

The parameters after fitting include the target treatment means effect parameter corresponding to each gene, the gene effect parameter, the interaction effect parameter between the target treatment means and the gene expression data of the gene, and the significance probability of the interaction;

The p-th second data set is the verification set when the second data set is used for the p-th time fitting, and the other second data sets are the training set when the second data set is used for the p-th time fitting. v is a positive integer less than or equal to the N, the p is an integer from 1 to M in sequence, and the M is an integer greater than 1.
The method according to claim 2, wherein the determining the training decision threshold based on the trained single-gene discovery model and the number of training genes input to the gene discovery model comprises:

For each gene whose interaction effect parameter is less than 0, sort the genes in descending order of the significance probability of the interaction;

Obtain multiple sets of parameter combinations, each of which includes the number of genes and the threshold value;

For each group of parameter combinations, input the gene expression data of the genes in the parameter combination into the gene discovery model in the sorted order to obtain the model output result; the gene discovery model includes the post-training data corresponding to each gene Single-gene discovery model;

The output result of the model is compared with the judgment threshold in the parameter combination to determine the first patient who is sensitive to the target treatment method and the second patient who is not sensitive to the target treatment method in the corresponding training set ；

Perform a log-rank test on the first survival curve of the patients who will be treated with the target treatment method and the second survival curve of the patients who will not be treated with the target treatment method among the first patients, and obtain the log-rank test result;

From the log-rank test results corresponding to the multiple sets of parameter combinations, determine the parameter combination corresponding to the minimum value of the log-rank test result to obtain the training determination threshold and the number of training genes.
The method according to claim 2, wherein the value of k is the same as the number of training genes obtained when the first data set is used for training for the first time.
The method according to any one of claims 1 to 6, wherein before the K-Fold-based cross-validation algorithm uses the patient data set to fit the model parameters of the gene discovery model, the method further comprises:

For each group of patient data, sequentially input the gene expression data in the patient data into the single gene discovery model;

For each gene expression data, using the survival data of the patient data as the output result of the single gene discovery model, determine the interaction effect parameter and the interaction significance probability corresponding to the gene expression data;

The genes whose interaction effect parameter is greater than or equal to 0 and the interaction significance probability is greater than a preset significance value are screened out.
The method according to any one of claims 1 to 6, wherein the gene discovery model is a proportional hazard regression model or a logistic model.
A sensitive gene discovery device, characterized in that the device comprises:

The data acquisition module is used to acquire a patient data set, the patient data set includes at least two sets of patient data, each set of patient data includes multiple gene expression data, clinical data, and survival data of the corresponding patient, and the clinical data includes whether to use Target treatment

The model acquisition module is used to acquire a gene discovery model. The model parameters of the gene discovery model include the target treatment method effect parameter, the gene effect parameter of each gene expression data, and the interaction effect parameter between the target treatment method and each gene expression data , The gene discovery model is used to predict genes that the patient is sensitive to the target treatment method;

The gene discovery module is configured to use the patient data set to fit the model parameters of the gene discovery model based on the K-Fold cross-validation algorithm to determine genes that are sensitive to the target treatment method in the multiple gene expression data.
A computer-readable storage medium, wherein a program is stored in the storage medium, and the program is used to implement the sensitive gene discovery method according to any one of claims 1 to 7 when the program is executed by a processor.