Multiple myeloma molecular typing and application
Technical Field
The invention belongs to the technical field of biology, and particularly relates to multiple myeloma molecular typing and application.
Background
Multiple Myeloma (MM) is a tumor caused by malignant proliferation of plasma cells, is the second most common hematological tumor, and has a disease rate of 1-2/hundred thousand in china. Multiple myeloma is a disease which is well developed in the elderly people older than 60 years, and the incidence of the multiple myeloma is increased year by year along with the aggravation of the aging degree in China, so that the multiple myeloma becomes a disease seriously threatening the health of the elderly people. Multiple myeloma is typically characterized by the presence of a large number of abnormally proliferating plasma cells in the bone marrow, which secrete an abnormal immunoglobulin or immunoglobulin fragment, the M protein.
With the use of proteasome inhibitors such as bortezomib and immunomodulatory drugs such as lenalidomide, the survival of multiple myeloma is significantly improved. However, multiple myeloma is still not completely cured at present. Multiple myeloma is highly heterogeneous biologically and clinically, and therefore its response to multi-drug combination therapy and its survival after treatment vary greatly among different patients. The biological mechanisms responsible for this discrepancy are not fully understood at present, and hinder the performance of personalized precision therapy to some extent. Therefore, in order to deepen understanding of biological essence of multiple myeloma and assist clinical treatment decision, development of simple and reliable molecular typing system is urgent. Currently, several multiple myeloma molecular typing systems have been proposed internationally. For example, Bergsagel et al identified 8 multiple myeloma subtypes with distinct cyclin d (cyclin d) expression and chromosomal translocations. Using unbiased and hypothesis-free transcriptome analysis, Zhan and Broyl et al suggested that multiple myeloma had 7-10 molecular subtypes, which could be further simplified into high-risk and low-risk groups depending on the length of the patient's survival. Furthermore, prognostic-related gene expression characteristics such as UAMS-70 and UAMS-17, UAMS-80, IFM-15, Millennium-100, EMC-92, gene amplification indices such as GPI-5, MRC-IX-6, and centrosome amplification indices have been proposed. However, the above molecular typing and expression characteristics cannot predict drug treatment response, and cannot be associated with the development process of plasma cells, and the association between genes for molecular typing and the cause of multiple myeloma has not been elucidated.
Disclosure of Invention
One object of the present invention is to provide the use of substances to obtain or detect 97 gene expression in multiple myeloma tumor patients to be tested.
The invention provides application of substances for obtaining or detecting 97 gene expressions in a patient with multiple myeloma tumor to be detected in preparation of a product for predicting prognosis survival rate of the patient with the multiple myeloma tumor to be detected.
The invention also provides application of the substance for obtaining or detecting 97 gene expressions in the multiple myeloma tumor patient to be detected in preparing a product for predicting the prognosis survival period of the multiple myeloma tumor patient.
The invention also provides application of substances for obtaining or detecting 97 gene expressions in a multiple myeloma tumor patient to be detected in preparation of products for predicting survival risk of the multiple myeloma tumor patient prognosis.
It is another object of the invention to provide the use of a substance to obtain or detect 97 gene expression in a subject to be tested for multiple myeloma tumors and an apparatus to run a bayesian classifier on multiple myeloma.
The invention provides application of a substance for obtaining or detecting 97 genes expressed in a multiple myeloma tumor patient to be detected and equipment for operating a Bayesian classifier of the multiple myeloma in preparation of a product for predicting the prognostic survival rate of the multiple myeloma tumor patient to be detected.
The invention also provides application of the substances for obtaining or detecting 97 gene expressions in the multiple myeloma tumor patient to be detected and equipment for operating the Bayesian classifier for the multiple myeloma in preparing a product for predicting the prognosis survival period of the multiple myeloma tumor patient.
The invention also provides application of the substances for obtaining or detecting 97 gene expressions in a multiple myeloma tumor patient to be detected and equipment for operating the Bayesian classifier for the multiple myeloma tumor in preparation of products for predicting the survival risk of the multiple myeloma tumor patient prognosis.
The 97 genes are as follows:
ACBD3, ADAR, ADSS, ALDH2, ANP32E, ANXA2, ATF3, ATP8B2, CACBBP, CAPN2, CCND1, CCT3, CDC42SE1, CERS2, CHSY3, CLIC1, CLMN, COPA, CSNK1G3, DAP3, DENND1B, ENSA, EPRS, EPSTI1, EVL, FAM13A, FAM49A, FLAD1, FRZB, GLRX2, HAX1, HDGF, HLA-A, HLA-B, HLA-C, HLA-C, HLA-C, HLA, TIP 20L C, HLA, UF C, HLA, MCKLF C, HLA, LAMTOR C, HLA, LDTP C, HLA, MOXD C, HLA, MRPL C, HLA, TOVPOPP C, HLA, TRYP C, HLA, PSTRYP C, HLA, TRYP C, HLA, PSTNP C, HLA, TRYP C, HLA, PSTNB C, HLA, TRYP C, HLA, PSTNT C, HLA, TRYP C, HLA, PST C, HLA, TRYP.
The Bayesian classifier for the multiple myeloma is obtained according to a method comprising the following steps:
1) obtaining the expression quantity data of 97 genes of n multiple myeloma samples;
2) dividing the expression quantity data of 97 genes of the n multiple myeloma samples into two subtypes of MCL1-M-High and MCL1-M-Low by using a consensus clustering algorithm;
3) and (3) constructing a naive Bayes classifier based on the two subtypes in the step 2, the expression quantity data of the 97 genes of the n multiple myeloma samples in the step 1) and the prognostic survival period data of the n multiple myeloma samples by using a naive Bayes method.
The step 3) is to randomly divide the n multiple myeloma samples into a training set and a verification set according to the ratio of the number of the samples to be more than 1: 1; then establishing a Bayesian classifier for predicting the MCL1-M-High subtype and MCL1-M-Low subtype of the single patient by using the naive Bayesian algorithm in the R language machine learning package klaR by using the expression quantity data of the 97 genes in training and combining the MCL1-M-High and MCL1-M-Low subtype labels of each sample obtained by the Consensus Clustering algorithm;
the expression level data of 97 genes in each multiple myeloma sample is obtained by detection or from a database.
The 3 rd object of the present invention is to provide a product.
The product provided by the invention comprises a substance for obtaining or detecting 97 genes expressed in a multiple myeloma tumor patient to be detected and equipment for operating a Bayesian classifier for the multiple myeloma.
In the above product, the product has at least one of the following functions:
1) predicting the survival rate of the prognosis of the multiple myeloma tumor patient to be detected;
2) predicting prognostic survival in patients with multiple myeloma tumors;
3) predicting prognostic survival risk in multiple myeloma tumor patients.
The product also comprises a carrier for recording the detection method;
the detection method comprises the following steps: the detection method comprises the following steps: obtaining the expression quantity data of 97 genes of the multiple myeloma tumor patient to be detected by using the substance for obtaining or detecting the 97 gene expression in the multiple myeloma tumor patient to be detected; and classifying the 97 gene expression quantity data of the multiple myeloma tumor patients to be detected by using the multiple myeloma Bayesian classifier, wherein the predicted prognosis survival time of the multiple myeloma tumor patients to be detected belonging to MCL1-M-High subtype is obviously lower than that of the multiple myeloma tumor patients to be detected belonging to MCL1-M-Low subtype.
In the product, the multiple myeloma tumor patient to be detected is a single patient or a plurality of patients.
The 4 th purpose of the invention is to provide a method for constructing a model for typing multiple myeloma tumor patients.
The method provided by the invention comprises the following steps:
1) obtaining the expression quantity data of 97 genes of n multiple myeloma samples;
2) dividing the expression quantity data of 97 genes of the n multiple myeloma samples into two subtypes of MCL1-M-High and MCL1-M-Low by using a consensus clustering algorithm;
3) based on the two subtypes in the step 2), the expression quantity data of the 97 genes of the n multiple myeloma samples in the step 1) and the prognostic survival period data of the n multiple myeloma samples, a naive Bayes method is used for constructing and obtaining a naive Bayes classifier, which is a target model.
The product also comprises a device for operating the Bayesian classifier for the multiple myeloma (the device can be an optical disc or a computer, etc.);
the product also comprises a carrier for recording the detection method;
the detection method comprises the following steps: obtaining the expression quantity data of 97 genes of the multiple myeloma tumor patient to be detected by using the substance for obtaining or detecting the 97 gene expression in the multiple myeloma tumor patient to be detected; and classifying the 97 gene expression quantity data of the multiple myeloma tumor patients to be detected by using the multiple myeloma Bayesian classifier, wherein the predicted prognosis survival time of the multiple myeloma tumor patients to be detected belonging to MCL1-M-High subtype is obviously lower than that of the multiple myeloma tumor patients to be detected belonging to MCL1-M-Low subtype.
In the product, the multiple myeloma tumor patient to be detected is a single patient or a plurality of patients.
In the above product, the n multiple myeloma samples are 551 samples.
Or the ratio of more than 1:1 is according to 2: a scale of 1 randomly partitions the training set and the validation set.
The product has at least one of the following functions:
1) predicting the survival rate of the prognosis of the multiple myeloma tumor patient to be detected;
2) predicting prognostic survival in patients with multiple myeloma tumors;
3) predicting prognostic survival risk in multiple myeloma tumor patients.
It is another object of the invention to provide a method for constructing a model for typing multiple myeloma tumor patients.
The method provided by the invention comprises the following steps:
1) obtaining the expression quantity data of 97 genes of n multiple myeloma samples;
2) dividing the expression quantity data of 97 genes of the n multiple myeloma samples into two subtypes of MCL1-M-High and MCL1-M-Low by using a consensus clustering algorithm;
3) based on the two subtypes in the step 2), the expression quantity data of the 97 genes of the n multiple myeloma samples in the step 1) and the prognostic survival period data of the n multiple myeloma samples, a naive Bayes method is used for constructing and obtaining a naive Bayes classifier, which is a target model.
The application of the substance for obtaining or detecting 97 gene expressions in the multiple myeloma tumor patient to be detected and/or the equipment for operating the Bayesian classifier for multiple myeloma or the model obtained by the method in the preparation of the product for predicting the prognosis survival rate of the multiple myeloma tumor patient to be detected is also within the protection scope of the invention.
The application of the substance for obtaining or detecting 97 gene expressions in the multiple myeloma tumor patient to be detected and/or the equipment for operating the Bayesian classifier for multiple myeloma or the model obtained by the method in the preparation of the product for predicting the prognosis survival period of the multiple myeloma tumor patient is also within the protection scope of the invention.
The application of the substance for obtaining or detecting 97 gene expressions in a multiple myeloma tumor patient to be detected and/or the equipment for operating the Bayesian classifier for multiple myeloma or the model obtained by the method in the preparation of a product for predicting the survival risk of the multiple myeloma tumor patient after prognosis is also within the protection scope of the invention.
The invention also provides a method for typing multiple myeloma tumor patients to be detected, which comprises the following steps:
detecting or obtaining the expression quantity data of 97 genes of the multiple myeloma tumor patient to be detected; and typing the expression quantity data of 97 genes of the multiple myeloma tumor patient to be detected by using the multiple myeloma Bayesian classifier to obtain whether the multiple myeloma tumor patient to be detected belongs to an MCL1-M-High subtype or an MCL1-M-Low subtype.
The invention also provides a method for predicting the survival risk of a patient with multiple myeloma tumor prognosis, which comprises the following steps: detecting or obtaining the expression quantity data of 97 genes of the multiple myeloma tumor patient to be detected; and then typing the expression quantity data of 97 genes of the multiple myeloma tumor patients to be detected by using the multiple myeloma Bayesian classifier, wherein the predicted survival rate of the multiple myeloma tumor patients to be detected, which belong to the MCL1-M-High subtype, is lower than that of the multiple myeloma tumor patients to be detected, which belong to the MCL1-M-Low subtype.
The invention also provides a method for predicting the survival risk of a patient with multiple myeloma tumor prognosis, which comprises the following steps: detecting or obtaining the expression quantity data of 97 genes of the multiple myeloma tumor patient to be detected; then, the expression quantity data of 97 genes of the multiple myeloma tumor patient to be detected is classified by using the multiple myeloma Bayesian classifier, and the risk of low prognosis survival rate of the multiple myeloma tumor patient to be detected, which belongs to MCL1-M-High subtype, is High; the risk of Low prognostic survival of patients with multiple myeloma tumors to be tested that belong to the MCL1-M-Low subtype is small.
The expression levels of the above genes are all gene expression levels in tumor cells.
To overcome the above drawbacks, the inventors explored whether a gene co-expression network surrounding key signal pathways conserved during Germinal Center (GC) plasma cell development could help elucidate MM pathogenesis and apply to molecular typing of MM. The inventors focused on the search for a gene network in multiple myeloma that controls dysregulation in the process of B-cell differentiation into plasma cells, as it may play a key role in the development of multiple myeloma. Through the analysis, a gene module (MCL 1-M for short) which is co-expressed with the MCL1 gene is identified, and the gene module is applied to divide the multiple myeloma into two main subtypes, namely MCL-M-High subtype and MCL-M-Low subtype. These two subtypes have significantly different prognostic and genetic characteristics and, more importantly, the classification system is also capable of predicting the response of a patient to bortezomib therapy and is associated with the developmental stage of plasma cells. These findings can pave the way for the future implementation of individualized and precise treatment and can also improve the understanding of the cause of multiple myeloma.
Drawings
FIG. 1 is a graphical representation of the classification result ROC of a Bayesian classifier in a GSE2658 validation set.
FIG. 2 is a graph of Bayesian classifier classification results ROC in MMRF data sets.
FIG. 3 is a Bayesian classifier typing result ROC diagram in a GSE19784 data set.
FIG. 4 is a graph of the overall survival curves for the multiple myeloma MCL1-M-High and MCL1-M-Low molecular subtypes in GSE 2658.
FIG. 5 is a graph of the overall survival curves for the multiple myeloma MCL1-M-High and MCL1-M-Low molecular subtypes in GSE 2658.
FIG. 6 is a graph of the overall survival curves (upper panel) and progression-free survival curves (lower panel) for the multiple myeloma MCL1-M-High and MCL1-M-Low molecular subtypes in GSE 19784.
FIG. 7 shows that patients of subtypes MCL1-M-High and MCL1-M-Low in GS19784 have different responses to bortezomib treatment.
Detailed Description
The experimental procedures used in the following examples are all conventional procedures unless otherwise specified.
Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1 screening of molecular diagnostic markers for multiple myeloma and implementation of molecular typing
87 genes co-expressed with MCL1 were obtained by Pearson correlation analysis using the multiple myeloma expression data set GSE2658 provided by the NCBI GEO public database, and based thereon, 46 genes enriched in multiple myeloma samples with low expression of MCL1-M gene were identified. For more stable molecular typing, 36 genes of the 133 genes with low classification efficiency were further screened, and finally 97 classification genes with stable differential expression and relatively high abundance were retained.
The names of 97 genes are as follows:
ACBD3, ADAR, ADSS, ALDH2, ANP32E, ANXA2, ATF3, ATP8B2, CACBBP, CAPN2, CCND1, CCT3, CDC42SE1, CERS2, CHSY3, CLIC1, CLMN, COPA, CSNK1G3, DAP3, DENND1B, ENSA, EPRS, EPSTI1, EVL, FAM13A, FAM49A, FLAD1, FRZB, GLRX2, HAX1, HDGF, HLA-A, HLA-B, HLA-C, HLA-C, HLA-C, HLA, TIP 20L C, HLA, UF C, HLA, MCKLF C, HLA, LAMTOR C, HLA, LDTP C, HLA, MOXD C, HLA, MRPL C, HLA, TOVPOPP C, HLA, TRYP C, HLA, PSTRYP C, HLA, TRYP C, HLA, PSTNP C, HLA, TRYP C, HLA, PSTNB C, HLA, TRYP C, HLA, PSTNT C, HLA, TRYP C, HLA, PST C, HLA, TRYP.
These 97 genes will then serve as classifier for typing. The 97 gene expression data in 551 cases of multiple myeloma samples of the GSE2658 dataset are used for classifying the 551 cases of multiple myeloma into two subtypes of MCL1-M-High and MCL1-M-Low in an unsupervised Clustering manner by using a Consensus Clustering algorithm. But the clustering-based classification method cannot perform molecular typing on individual samples. To perform the individualized diagnosis, the 551 samples were processed as follows: the scale of 1 was randomly divided into a training set (369 cases) and a validation set (182 cases) for establishing and evaluating an individualized classifier. The sampling adopts a layered sampling mode to ensure that the proportion of the two subtypes MCL1-M-High and MCL1-M-Low in the training set and the test set is consistent with the original proportion.
According to the expression quantity data of 97 classified genes of 369 samples in the training set and two subtype labels of MCL1-M-High and MCL1-M-Low separated by a Consensus Clustering algorithm, a Bayesian classifier for multiple myeloma capable of predicting the MCL1-M-High subtype and the MCL1-M-Low subtype of a single patient is established by using a naive Bayesian algorithm provided by a machine learning package klaR package in R language.
And the accuracy of its classification was evaluated using 182 samples in the validation set.
And (3) continuously iterating the optimization model according to the returned accuracy, and finally enabling the classification accuracy to exceed 95%, wherein the accuracy data of the classifier is shown in a table 1, and a Receiver Operating Curve (ROC) is shown in the table 1.
TABLE 1 accuracy of classifiers in validation set established Using GSE2658 dataset
To determine that classifiers built using the GSE2658 dataset can be generalized. The inventors subsequently used this classifier to predict the molecular subtypes of the samples in NCI-issued multiple myeloma large dataset MMRF and GEO multiple myeloma expression profile dataset GSE 19784.
The MMRF data set is different from GSE2658, and the expression quantity of the gene is obtained by RNA-seq, but not a chip. The predicted results are shown in Table 2, and the ROC diagram is shown in FIG. 2.
TABLE 2 accuracy in MMRF data set of classifiers built using GSE2658 data set
The result shows that the classifier can keep high accuracy even if the classifier is cross-platform, which indicates that the classifier has high popularization and application value.
GSE19784 is also a multiple myeloma expression data set, and as with GSE2618, a ue1332.0plus chip was used to measure gene expression. However, where both are detected by different experiments at different times, the experimental conditions may differ, which results in the data of both being significantly different in distribution and noise level. The database subtype prediction results are shown in Table 3, and the ROC curve is shown in FIG. 3.
TABLE 3 accuracy in the GSE19784 dataset of classifiers built using the GSE2658 dataset
The result shows that the classifier can better overcome the problems and still maintain higher accuracy.
Example 2 use of Bayesian classifier for multiple myeloma in predicting patient prognostic survival
Database GSE2658
Based on the expression level data of 97 classifier genes of 551 samples (detection before treatment) of multiple myeloma patients in the GSE2658 database, the 551 samples were classified by the multiple myeloma Bayesian classifier obtained in example 1 to obtain 249 MCL1-M-High subtype multiple myeloma and 302 MCL1-M-Low subtype multiple myeloma.
The follow-up visit of 551 patients 72 months after treatment was followed, and survival analysis (K-M analysis and cox regression analysis) was performed based on the follow-up visit results, as shown in fig. 4, which show that the two multiple myeloma subtypes MCL1-M-High and MCL1-M-Low have significantly different prognosis, and the overall survival rate of the MCL1-M-High subtype is lower than that of the MCL1-M-Low subtype (log-rank test, p 0.0201, likelihood ratio test, risk ratio 1.588, p 0.0212).
Therefore, the Bayesian classifier is used for typing 97 genes in the MCL1 gene group, and can be used for predicting the prognosis of a patient to be tested.
Second, database MMRF
According to the expression quantity data of 97 classifier genes in the MCL1 gene group of 534 samples of multiple myeloma patients (detection before treatment) in the MMRF data set, the 534 samples are divided into two subtypes of MCL1-M-High (231 samples) and MCL1-M-Low (303 samples) by adopting the multiple myeloma Bayesian classifier obtained in example 1.
Follow-up 48 months after treatment of 534 patients with samples followed by follow-up, survival analysis (K-M analysis and cox regression analysis) was performed based on the follow-up results, as shown in fig. 5, and it can be seen that, in the MMRF data set, both the MCL1-M-High and MCL1-M-Low subtypes also had significantly different prognosis, the overall survival rate of the MCL1-M-High subtype was lower than that of the MCL1-M-Low subtype (log-rank test, p 0.006663, likelihood ratio test, risk ratio 1.838, p 0.00706).
The results show that regardless of the platform from which the gene expression data comes, the Bayesian classifier is adopted to classify by using 97 genes in the MCL1 gene group, and the classification can be used for predicting the prognosis of the patient to be tested.
Third, database GSE19784
Based on the expression quantity data of 97 classifier genes in the MCL1 gene group of a verification set of 304 samples (detection before treatment) of multiple myeloma patients in the GSE19784 database, the 304 samples are divided into two subtypes, namely MCL1-M-High (107 samples) and MCL1-M-Low (196 samples), by using the multiple myeloma Bayesian classifier obtained in example 1.
Follow-up of 304 patients at 96 months post-treatment with follow-up samples, and from the follow-up results, survival analyses (K-M analysis and cox regression analysis) were performed, the results are shown in fig. 6 (a for overall survival and B for progression-free survival), and it can be seen that, in the GSE19784 dataset, both the MCL1-M-High and MCL1-M-Low subtypes also had significantly different prognoses, with overall survival for the MCL1-M-High subtype being lower than for the MCL1-M-Low subtype (log-rank test, p <0.0001, likelihood ratio test, risk ratio 1.91, p 0.0002). The GSE19784 dataset also included information on the progression of the disease, so we also analyzed the difference in progression-free survival, similarly, the progression-free survival for the MCL1-M-High subtype was also lower than for the MCL1-M-Low subtype by the log-rank test, p 0.0282, likelihood ratio test, risk ratio 1.36, p 0.031) this result again demonstrates that typing with 97 genes of the MCL1 gene cluster using the bayesian classifier can be used to predict the prognosis of the patients to be tested.
Example 3 molecular diagnostic markers and typing of multiple myeloma to predict whether a test patient will be able to be treated with bortezomib
GSE19784 multiple myeloma expression data set was derived from a phase III drug clinical trial (HOVON-65/GMMG-HD4) accompanied by a patient regimen. The test randomly divided patients into two groups receiving two drug combinations of VAD (155 cases) and PAD (148 cases), and the difference between the two groups was that the PAD scheme was supplemented with bortezomib (trade name: velcade). The gene expression data of the patients to be treated are collected before treatment.
The patients were stratified by MCL1-M molecular typing as described above, and then survival analysis (K-M analysis and cox regression analysis) was performed by dividing them into two subtypes, MCL1-M-High (PAD 51, VAD 56) and MCL1-M-Low (PAD 104, VAD 92) according to the drug treatment regimen.
As shown in FIG. 7, A is the overall survival rate of MCL1-M-High group, B is the overall survival rate of MCL1-M-Low group, C is the progression-free survival rate of MCL1-M-High group, and D is the progression-free survival rate of MCL1-M-Low group; it was observed that the lifetime, especially the progression-free lifetime, of patients in the MCL1-M-High group could only be prolonged with the PAD drug of bortezomib (fig. 7 left, MCL-M-High group, right MCL-M-Low group; upper, overall survival curve, lower, progression-free survival curve), which revealed that bortezomib could clinically delay the recurrent exacerbations in patients in the MCL-M-High group, but had no effect in patients in the MCL-M-Low group. In conclusion, the molecular typing of the invention can guide clinical medication, can avoid using bortezomib in MCL1-M-Low group patients, can reduce the economic burden of patients on one hand, and can reduce the side effect of the patients caused by drug treatment on the other hand.