CN116064822A - Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map - Google Patents

Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map Download PDF

Info

Publication number
CN116064822A
CN116064822A CN202310215808.0A CN202310215808A CN116064822A CN 116064822 A CN116064822 A CN 116064822A CN 202310215808 A CN202310215808 A CN 202310215808A CN 116064822 A CN116064822 A CN 116064822A
Authority
CN
China
Prior art keywords
lung adenocarcinoma
kit
subject
risk
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310215808.0A
Other languages
Chinese (zh)
Inventor
金腾川
程庆宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202310215808.0A priority Critical patent/CN116064822A/en
Publication of CN116064822A publication Critical patent/CN116064822A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/106Pharmacogenomics, i.e. genetic variability in individual responses to drugs and drug metabolism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/118Prognosis of disease development
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Pathology (AREA)
  • Zoology (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Oncology (AREA)
  • Artificial Intelligence (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Hospice & Palliative Care (AREA)
  • Microbiology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Biotechnology (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a lung adenocarcinoma diagnosis model and a prognosis model based on machine learning and single cell transcription maps. In particular, the invention provides a kit for predicting or diagnosing lung adenocarcinoma in a subject, or determining whether a subject is at risk of developing lung adenocarcinoma, comprising reagents for detecting the expression levels or amounts of the biomarkers SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1 and CLDN18 in a sample, and a tag. The invention also provides a kit comprising reagents for detecting the expression level or amount of the biomarkers MRPS11, CD3EAP, EMC6, DMD, SIX5 and STK33 in a sample, and a label; the kit is used for predicting the prognosis risk of a lung adenocarcinoma subject, predicting the survival probability of the lung adenocarcinoma subject for 1 year, 3 years or 5 years, or predicting the response probability of a lung adenocarcinoma patient to paclitaxel or pemetrexed.

Description

Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map
Technical Field
The invention belongs to the field of biotechnology (medical diagnosis/medical detection field), and particularly relates to a machine learning model and a regression model for diagnosing lung adenocarcinoma, evaluating prognosis of a patient and evaluating drug response.
Background
The incidence of lung cancer has been high and its mortality rate has remained the leading cause of death from cancer. Lung adenocarcinomas are the most common type of lung cancer, and are often found in advanced stages with a concomitant high mortality rate. Lung adenocarcinoma tissues have high degree of inter-tumor heterogeneity and intra-tumor heterogeneity, which results in inaccurate diagnosis and poor therapeutic effect. Single cell sequencing techniques have evolved, single cell transcriptome sequencing techniques can delineate heterogeneity of tumor tissue, tumor microenvironment changes, cell composition, marker changes, etc. at single cell resolution.
The diagnosis of the lung cancer at present mainly comprises auxiliary imaging examination and pathological examination (including histological examination, cytological examination and serological examination). Among them, the auxiliary imaging examination mainly includes methods such as X-ray chest radiography, CT, magnetic Resonance Imaging (MRI), ultrasound, and positron emission computed tomography (PET-CT), and although a nodule is observed, whether it is a tumor tissue or a malignant condition cannot be confirmed, and sometimes, diagnosis and treatment may be performed excessively. Cytologic examination mainly comprises sputum cytology examination, thoracocentesis, chest wall lung puncture, superficial lymph node and subcutaneous transfer nodule biopsy, bronchoscopy, mediastinoscope examination, thoracoscopy and the like, but the methods still have certain defects such as low diagnosis positive rate, limited examination range, high detection difficulty, high wound risk and the like. Laboratory serological examination of lung cancer can assist in supporting diagnosis, but the sensitivity and specificity of diagnosis are still insufficient, a single marker cannot be used as a diagnosis index, and meanwhile, the diagnosis index is also influenced by other factors to obtain a false positive result, for example, the sensitivity index CYFRA21 of non-small cell lung cancer can also generate false increase in renal failure. Therefore, the new accurate auxiliary diagnosis model and evaluation system can improve the current situations of inaccurate diagnosis and excessive diagnosis to a certain extent. In addition, patients with lung adenocarcinoma have poor treatment effect and poor prognosis, but a perfect evaluation system is not yet available, so that an accurate and sensitive prognosis model is also needed to infer prognosis outcome and drug treatment response.
Disclosure of Invention
The present invention relates generally to machine learning models for lung adenocarcinoma diagnosis and regression models for patient prognosis evaluation and drug (paclitaxel or pemetrexed) response assessment.
Regarding the diagnostic model, the invention is based on a single cell transcriptome sequencing technology, and by training a machine learning model, 13 markers (genes) which can be used for distinguishing tumor tissues from normal tissues are identified, namely SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1 and CLDN18. By training a random forest model based on the 13 markers, the lung adenocarcinoma tissues and the normal lung tissues can be distinguished more accurately. In the training set data, our diagnostic model can achieve an accuracy of about 96.2%, and correspondingly, AUC values in ROC analysis can reach about 0.993; in the TCGA test dataset (source different from the training set data), the diagnostic model can achieve an accuracy of about 93.8%, and correspondingly, AUC values can reach about 0.987; in the test dataset GSE31210 (source different from the training set data), the diagnostic model can achieve an accuracy of about 91.9%, and correspondingly, AUC values can reach about 0.896. The diagnosis model not only has higher accuracy, but also can obtain high sensitivity and high specificity through ROC analysis (AUC values are 0.987 and 0.896 respectively) of test set data.
In a specific embodiment of the invention with respect to a diagnostic model, the kit predicts or diagnoses lung adenocarcinoma in a subject or determines whether a subject is at risk of developing lung adenocarcinoma by comprising the steps of:
(1) Determining the level or amount of expression of the biomarker in a sample from the subject using a kit; and
(2) Inputting the level of each biomarker in the sample into a random forest model to calculate the probability of lung adenocarcinoma, wherein the training set data of the random forest model comprises data on the expression level or expression amount of each biomarker for a plurality of subject tumor tissues and a plurality of normal control tissues having lung adenocarcinoma;
wherein when the calculated probability of lung adenocarcinoma is greater than a threshold value (e.g., 0.5), the subject is indicated to have lung adenocarcinoma or is at risk of developing lung adenocarcinoma.
Regarding the prognosis model, the invention also carries out feature screening based on single cell transcriptome sequencing technology and combines survival analysis and regression analysis, 6 markers which are MRPS11, CD3EAP, EMC6, DMD, SIX5 and STK33 are finally obtained, and a prognosis risk scoring model is constructed based on the formula: risk score = Σ (coefficient× normalized expression value), each coefficient was 0.098, 0.181, 0.123, -0.257, 0.114, -0.288 in order. At the same time, prognostic risk scoring based thereon can be used to assess the survival probability of patients for different survival periods and response assessment to drugs (paclitaxel or pemetrexed).
The invention provides a diagnosis model suitable for lung adenocarcinoma, a construction method thereof and a combination mode of 13 characteristic genes.
The invention provides a prognosis risk model suitable for lung adenocarcinoma, a construction method thereof and a combined mode of 6 characteristic genes.
The invention provides a prognosis risk assessment, survival probability assessment of different survival periods and drug (paclitaxel or pemetrexed) response assessment applicable to lung adenocarcinoma.
The invention provides clinical guidance applications of the described diagnostic model and prognosis model, including but not limited to clinical applications involving the diagnostic model and prognosis model and related product development, such as clinical prognosis evaluation, clinical medication guidance, and detection kits.
Specifically, the invention provides the following technical scheme:
1. a kit comprising reagents for detecting the expression level or amount of a biomarker in a sample, and a tag; wherein the biomarker is SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1, and CLDN18;
preferably, the kit is used to predict or diagnose lung adenocarcinoma in a subject, or to determine whether a subject is at risk of developing lung adenocarcinoma.
Preferably, the kit further comprises additional reagents selected from the group consisting of reagents for processing the sample, and reagents for performing PCR amplification (e.g., polymerase, dntps, and amplification buffers).
2. The kit of item 1, wherein the reagent comprises a primer set capable of specifically amplifying SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1 and CLDN18.
3. The kit of item 1, wherein the expression level of the biomarker is used to input a random forest model to calculate the probability of lung adenocarcinoma; and, in addition, the processing unit,
wherein the label states that when the calculated probability of lung adenocarcinoma is greater than a threshold value (e.g., 0.5), the subject suffers from lung adenocarcinoma or is at risk of developing lung adenocarcinoma.
4. The kit according to item 3, wherein the expression levels or the expression amounts of SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1, CLDN18 are normalized with GAPDH as a reference and then input into a random forest model to calculate the probability of lung adenocarcinoma.
5. The kit of item 1, wherein the sample is a percutaneous pulmonary biopsy of the subject.
6. Use of a reagent for determining the expression level or amount of a biomarker in the preparation of a kit for predicting or diagnosing lung adenocarcinoma in a subject or determining whether a subject is at risk of developing lung adenocarcinoma; wherein the biomarker is SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1, and CLDN18.
7. A kit comprising reagents for detecting the expression level or amount of a biomarker in a sample, and a tag, wherein the biomarker is MRPS11, CD3EAP, EMC6, DMD, SIX5, and STK33;
preferably, the signature describes the formula for calculating a prognostic risk score: risk score = 0.098 x MRPS11 normalized expression +0.181 x CD3EAP normalized expression +0.123 x EMC6 normalized expression + 0.257 x DMD normalized expression +0.114 x SIX5 normalized expression-0.288 x STK33 normalized expression, the prognostic risk score is used to obtain a survival probability of a lung adenocarcinoma subject for 1, 3 or 5 years by no Mo Tu in combination with a clinically determined grading of the tumor, or the prognostic risk score is used to obtain a response probability of a lung adenocarcinoma patient to paclitaxel or pemetrexed by no Mo Tu;
preferably, the expression levels or the expression amounts of the MRPS11, the CD3EAP, the EMC6, the DMD, the SIX5 and the STK33 are normalized by taking the average value of the expression amounts of the MRPS11, the CD3EAP, the EMC6, the DMD, the SIX5 and the STK33 as a reference, so as to obtain the respective normalized expression amounts;
preferably, the kit further comprises additional reagents selected from the group consisting of reagents for processing the sample, and reagents for performing PCR amplification (e.g., polymerase, dntps, and amplification buffers);
preferably, the kit is for predicting the risk of prognosis of a lung adenocarcinoma subject, for predicting the survival probability of a lung adenocarcinoma subject for 1 year, 3 years or 5 years, or for predicting the probability of response of a lung adenocarcinoma patient to paclitaxel or pemetrexed.
8. The kit according to item 7, wherein the reagent comprises a primer set capable of specifically amplifying MRPS11, CD3EAP, EMC6, DMD, SIX5 and STK 33.
9. The kit of item 7, wherein the sample is a percutaneous pulmonary biopsy of a subject patient.
10. Use of a reagent for determining the expression level of a biomarker in the preparation of a kit, wherein the biomarker is MRPS11, CD3EAP, EMC6, DMD, SIX5, and STK33;
preferably, the kit is for predicting the risk of prognosis of a lung adenocarcinoma subject, for predicting the survival probability of a lung adenocarcinoma subject for 1 year, 3 years or 5 years, or for predicting the probability of response of a lung adenocarcinoma patient to paclitaxel or pemetrexed.
Advantages and positive effects of the invention
(1) The diagnostic model and the prognostic model related by the invention are both based on single-cell transcriptome maps for screening characteristic genes, have higher resolution and finer expression profile information than group transcriptome sequencing, and support the accuracy and sensitivity of model detection based on a machine learning model and a regression model.
(2) The diagnosis model and the prognosis model related by the invention are based on the specific standardized characteristic gene expression profile, can accept input files from multiple platforms and multiple methods, and support the wide applicability and convenience of detection.
Drawings
FIG. 1 shows the expression pattern of the related characteristic genes of the diagnostic model in normal tissues and tumor tissues.
Fig. 2. Diagnostic model evaluation of ROC analysis results in training dataset (a), TCGA test dataset (B) and GSE31210 test dataset (C).
FIG. 3 regression analysis results of the relevant characteristic genes of the prognostic model. Graph (a) is Cox regression analysis results of 6 characteristic genes in the prognosis model, graph (B) is Cox regression analysis results of prognosis risk scoring, the scoring is indicated to be a significant risk factor, and graph (C) is that after grouping according to the quartile division of risk scoring, the survival states of high-risk group patients have very significant differences, and the median survival time of high-risk group patients is very significantly lower than that of low-risk group patients.
FIG. 4. Evaluation of prognostic models in different platform-derived datasets. Panel (a) is the Cox regression analysis results of the prognosis model in 3 other data sets (GSE 13213, GSE31210, GSE 72094), suggesting that the prognostic risk scores are all significant risk factors, and similarly, survival analysis results are also shown in data pair columns GSE13213 (B), GSE31210 (C), GSE72094 (D), with the contemporaneous survival probability for high risk group patients being significantly lower than for low risk group patients.
Figure 5 prognostic risk model and drug response. Based on the prognostic risk model and clinical phenotype data, the prognostic risk score was significantly higher for both paclitaxel (a) and pemetrexed (B) than for the non-responsive group.
FIG. 6. Clinical guidance uses based on prognostic models, including prediction of survival probability for different survival periods and prediction of drug (paclitaxel or pemetrexed) response. The graph (A) is a survival probability prediction model of different survival periods, the accuracy of the prediction result is higher, and can be seen from survival probability calibration curves of 1 year (B), 3 years (C) and 5 years (D), the graph (E) is a ROC analysis result of drug (paclitaxel or pemetrexed) response prediction, and the graph (F) is a ROC analysis result of drug (paclitaxel or pemetrexed) response prediction.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
Example 1 construction, evaluation and use of lung adenocarcinoma diagnostic model
(1) Tumor and normal tissue specific marker recognition
Based on single cell transcriptome data (GSE 131907, obtained from GEO database, containing 58 samples) of lung adenocarcinoma publicly available in the prior art, 83,429 single cells and gene expression patterns thereof are obtained in total through preliminary quality control and filtration; tumor microenvironment composition of lung adenocarcinoma was delineated by the semat pack treatment and epidermal cell fraction, i.e. comprising normal cells and tumor cells of different stages, was extracted. A number of specific markers in tumor and normal tissue were further identified by findalmarkers in the setup package, an example of which is shown in fig. 1.
(2) Construction, evaluation and use of diagnostic models
Based on a large number of tissue-specific markers obtained in the previous step, we first calculate the expression proportion difference between tumor cells and normal cells of each gene in different periods, and arrange the genes in descending order, and respectively take the genes with the top 200 ranks to combine and de-repeat the characteristics; at this time, the obtained expression amounts of all genes are normalized by taking GAPDH as a reference, and then model training is carried out by combining sample information (normal or tumor) under an R language environment through a random forest function, so as to obtain a first random forest model.
We continue to select 13 feature genes with Gini coefficients decreasing by more than 50 in the first random forest model, namely SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1, CLDN18, and continue to train with the 13 genes under R language environment through a random forest function, further constructing a second random forest model (wherein the function training and random forest model constructing methods are both performed according to conventional methods); it was found that although the number of signature genes was reduced to 13, the diagnostic model could still achieve accuracy and AUC values consistent with the first random forest model (specifically, about 96.2% accuracy, and correspondingly, AUC values in ROC analysis could reach about 0.993).
Meanwhile, besides accurate representation in training set data, the diagnosis model is evaluated and verified in large-queue clinical data with different platform sources and different quantitative methods: in a cancer genetic mapping Technology (TCGA) database data queue (which contains 585 samples with different sources from training set data), the accuracy of a diagnosis model can reach 93.8%, and the AUC value in ROC analysis can reach 0.987; similarly, in data queue GSE31210 (source different from training set data, 246 samples), the diagnostic model accuracy can reach about 91.9% and AUC values can reach about 0.896, as shown in fig. 2. From the above, the diagnostic model not only has higher accuracy, but also can obtain high sensitivity and high specificity through ROC analysis (AUC values are 0.987 and 0.896 respectively) of test set data; the trained diagnosis model is stored as an RDS file, can be directly read into an R language environment, and can obtain the probability of diagnosing normal cells and tumor cells by a prediction function only by measuring and standardizing the expression values of the 13 genes and the GAPDH gene to be used as input files. With a threshold of 0.5, i.e. if the probability of predicting a tumor exceeds the threshold, it is indicative that the sample detected is a tumor sample and the subject has or is at risk of a tumor.
EXAMPLE 2 construction and evaluation of prognostic models
(1) Cox regression analysis of prognosis-related genes
We first obtained a signature gene (3272, P-value < 0.05) that significantly affected the survival status of the patient by performing a whole-gene survival analysis on TCGA expression profile data (which uses the publicly available R software package), and compared a large number of tumor tissue specific markers obtained in step (1) of example 1 at different times to the obtained survival analysis signature gene, with P-value less than 0.01 as a threshold in the survival analysis, 355 early and late tumor cell specific markers that can significantly affect the survival status of the patient were obtained. By performing single factor Cox regression analysis on these markers, we obtained the respective risk ratios (Hazard Ratio) for these markers, and further screening based on the risk Ratio (Hazard Ratio) being greater than 1.1 or less than 0.8, we obtained 6 signature genes, MRPS11, CD3EAP, EMC6, DMD, SIX5, STK33, and their corresponding coefficients, in order of 0.098, 0.181, 0.123, -0.257, 0.114, -0.288, as shown in fig. 3A.
(2) Construction and evaluation of prognosis models
When the prognosis model is constructed, the expression matrixes of the 6 genes (MRPS 11, CD3EAP, EMC6, DMD, SIX5 and STK 33) are constructed, and the expression quantity of each gene in the expression matrix is normalized by taking the average value of the expression quantity of the 6 genes in the sample as a reference. The prognosis risk scoring model was continuously constructed based on the coefficient (coefficient) obtained in step (1) of this example and this formula: risk score= Σ (coefficient× normalized expression value), i.e., risk score = 0.098×mrps11 normalized expression+0.181×cd3eap normalized expression+0.123×emc6 normalized expression-0.257×dmd normalized expression+0.114×six5 normalized expression-0.288×stk33 normalized expression.
The risk score obtained based on the risk model was used as a risk factor, and the mean of the risk ratio in the Cox regression analysis was 5.96, which indicates that the higher the score, the greater the risk, as shown in fig. 3B. Meanwhile, after the patients in the high-risk group and the low-risk group are divided into groups according to the quartiles of the risk scores, the survival states of the patients in the high-risk group are obviously different, and the median survival time of the patients in the high-risk group is obviously lower than that of the patients in the low-risk group, as shown in fig. 3C.
In addition, in other large-queue data sets (including 117 tumor samples), GSE13213 (including 226 tumor samples) and GSE72094 (including 420 tumor samples) which are available to the public, we also evaluate and verify the prognosis risk model, and obtain the result of consistency with the TCGA data set, as shown in fig. 4, which indicates that the prognosis model constructed in this embodiment can accurately predict the prognosis risk.
EXAMPLE 3 clinical guideline use of prognosis model
Based on the prognostic score of the risk model (i.e., the risk score obtained by the prognostic risk scoring model of example 2), in conjunction with the grading of clinically determined tumors, we can construct no Mo Tu (where TCGA, GSE31210 and GSE72094 are used as training sets in the construction of the nomogram) and predict the patient's 1 year, 3 years, 5 years survival probability by means of no Mo Tu, as shown in fig. 6A, with the following specific operations: and (3) making a vertical line upwards to intersect with a variable scoring line, so that a variable scoring value corresponding to the risk scoring can be obtained, meanwhile, based on the stage data of tumor tissues, a corresponding variable scoring value can be obtained, the added value of the two values is a total score, and then making a vertical line downwards at a corresponding score point on the variable scoring line to intersect with a survival probability line, wherein the probability at the intersection point is the survival probability of different survival periods. For example, for a subject clinically determined to be stage II of a tumor, if the risk score obtained in example 2 is 0.2, then crossing the "scoring" line by making a vertical line upward at 0.2 in the "prognostic risk score" line in fig. 6A, would obtain a score value corresponding to 0.2 of the risk score, i.e., 72.5; meanwhile, a vertical line is upwards arranged at stage II in a tumor stage II line and intersected with a scoring line, so that a scoring value corresponding to the tumor stage II, namely 15, can be obtained; then adding the scoring value corresponding to the risk scoring 0.2 (namely 72.5) and the scoring value corresponding to the tumor II stage (namely 15) to obtain a total score, namely 87.5; and (3) making a vertical line downwards at the position of 87.5 of the total score on the line of the total score to intersect with survival probability lines of 1 year, 3 years and 5 years respectively, wherein the probability at the intersection point is the survival probability of different survival periods. To evaluate the accuracy of the predicted outcome, we verified with 1 year, 3 years, 5 years of survival probability calibration curve, suggesting a high accuracy of the predicted outcome of this embodiment, as shown in fig. 6B-D.
Furthermore, based on the clinical phenotype data of patients and the drug response data of patients in TCGA database, we classified the patient response to drug (paclitaxel or pemetrexed) into two categories, i.e. response/disease progression controlled and no response, by comparing the prognostic risk score between the two groups (i.e. the risk score obtained by the prognostic risk scoring model of example 2), the risk score of no response group was significantly higher than that of the response/disease progression controlled group, as shown in fig. 5; further, we fit a no Mo Tu for predicting the response probability of the drug (paclitaxel or pemetrexed) based on the risk score alone, i.e. by means of a prognostic model, a prognostic risk score for the patient (as described in example 2 step (2) above, i.e. risk score = Σ (coefficient x normalized expression value)), was calculated, and a vertical line was drawn down with the risk score intersecting the variable "total score" line, response/disease progression controlled probability line, the intersection value being the response/disease progression controlled probability of the predicted drug (paclitaxel or pemetrexed), as shown in fig. 6E. For example, for a subject with a prognostic risk score of 0.5, then crossing the "probability of response/disease progression controlled" line by making a vertical line down at 0.5 of the "prognostic risk score" line in fig. 6E, the probability of response of the subject to a drug (paclitaxel or pemetrexed) is obtained, i.e., about 0.6; for a subject with a prognostic risk score of 0.2, a vertical line is drawn downward at 0.2 in the "prognostic risk score" line in fig. 6E and intersects the "probability of response/disease progression control" line, so that the probability of response of the subject to a drug (paclitaxel or pemetrexed) can be obtained, i.e., the probability of response to the drug is greater than 0.9, and the probability of response of the subject to paclitaxel or pemetrexed is predicted to be greater. Further ROC analysis demonstrated the accuracy of the prediction of the controlled probability of response/disease progression of the present invention with AUC values of 0.881, as shown in fig. 6F. Therefore, risk scoring based on the prognostic model can guide clinical analysis and drug selection to some extent.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.

Claims (10)

1. A kit comprising reagents for detecting the expression level or amount of a biomarker in a sample, and a tag; wherein the biomarker is SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1, and CLDN18;
preferably, the kit is used to predict or diagnose lung adenocarcinoma in a subject, or to determine whether a subject is at risk of developing lung adenocarcinoma.
Preferably, the kit further comprises additional reagents selected from the group consisting of reagents for processing the sample, and reagents for performing PCR amplification (e.g., polymerase, dntps, and amplification buffers).
2. The kit of claim 1, wherein the reagents comprise a primer set capable of specifically amplifying SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1 and CLDN18.
3. The kit of claim 1, wherein the expression level of the biomarker is used to input a random forest model to calculate the probability of lung adenocarcinoma; and, in addition, the processing unit,
wherein the tag states that when the calculated probability of lung adenocarcinoma is greater than a threshold value (e.g., 0.5), the subject is indicated to have lung adenocarcinoma or is at risk of developing lung adenocarcinoma.
4. A kit according to claim 3, wherein the expression level or the expression level of SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1, CLDN18 is normalized with GAPDH as a reference and then input into a random forest model to calculate the probability of lung adenocarcinoma.
5. The kit of claim 1, wherein the sample is a percutaneous pulmonary biopsy of a subject.
6. Use of a reagent for determining the expression level or amount of a biomarker in the preparation of a kit for predicting or diagnosing lung adenocarcinoma in a subject or determining whether a subject is at risk of developing lung adenocarcinoma; wherein the biomarker is SCGB1A1, IGKC, ADIRF, SFTPC, FABP5, CD24, SLPI, CYB5A, TPPP3, FABP4, IGHG4, FOLR1, and CLDN18.
7. A kit comprising reagents for detecting the expression level or amount of a biomarker in a sample, and a tag, wherein the biomarker is MRPS11, CD3EAP, EMC6, DMD, SIX5, and STK33;
preferably, the signature describes the formula for calculating a prognostic risk score: risk score = 0.098 x MRPS11 normalized expression +0.181 x CD3EAP normalized expression +0.123 x EMC6 normalized expression + 0.257 x DMD normalized expression +0.114 x SIX5 normalized expression-0.288 x STK33 normalized expression, the prognostic risk score is used to obtain a survival probability of a lung adenocarcinoma subject for 1, 3 or 5 years by no Mo Tu in combination with a clinically determined grading of the tumor, or the prognostic risk score is used to obtain a response probability of a lung adenocarcinoma patient to paclitaxel or pemetrexed by no Mo Tu;
preferably, the expression levels or the expression amounts of the MRPS11, the CD3EAP, the EMC6, the DMD, the SIX5 and the STK33 are normalized by taking the average value of the expression amounts of the MRPS11, the CD3EAP, the EMC6, the DMD, the SIX5 and the STK33 as a reference, so as to obtain the respective normalized expression amounts;
preferably, the kit further comprises additional reagents selected from the group consisting of reagents for processing the sample, and reagents for performing PCR amplification (e.g., polymerase, dntps, and amplification buffers);
preferably, the kit is for predicting the risk of prognosis of a lung adenocarcinoma subject, for predicting the survival probability of a lung adenocarcinoma subject for 1 year, 3 years or 5 years, or for predicting the probability of response of a lung adenocarcinoma patient to paclitaxel or pemetrexed.
8. The kit of claim 7, wherein the reagents comprise a primer set capable of specifically amplifying MRPS11, CD3EAP, EMC6, DMD, SIX5, and STK 33.
9. The kit of claim 7, wherein the sample is a percutaneous pulmonary biopsy of a subject patient.
10. Use of a reagent for determining the expression level of a biomarker in the preparation of a kit, wherein the biomarker is MRPS11, CD3EAP, EMC6, DMD, SIX5, and STK33;
preferably, the kit is for predicting the risk of prognosis of a lung adenocarcinoma subject, for predicting the survival probability of a lung adenocarcinoma subject for 1 year, 3 years or 5 years, or for predicting the probability of response of a lung adenocarcinoma patient to paclitaxel or pemetrexed.
CN202310215808.0A 2023-03-08 2023-03-08 Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map Pending CN116064822A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310215808.0A CN116064822A (en) 2023-03-08 2023-03-08 Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310215808.0A CN116064822A (en) 2023-03-08 2023-03-08 Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map

Publications (1)

Publication Number Publication Date
CN116064822A true CN116064822A (en) 2023-05-05

Family

ID=86177005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310215808.0A Pending CN116064822A (en) 2023-03-08 2023-03-08 Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map

Country Status (1)

Country Link
CN (1) CN116064822A (en)

Similar Documents

Publication Publication Date Title
CN103299188B (en) Molecular diagnostic assay for cancer
US9238841B2 (en) Multi-biomarker-based outcome risk stratification model for pediatric septic shock
MX2007003502A (en) Methods and computer program products for analysis and optimization of marker candidates for cancer prognosis.
CN111653314B (en) Method for analyzing and identifying lymphatic infiltration
CN111564214A (en) Establishment and verification method of breast cancer prognosis evaluation model based on 7 special genes
KR102170726B1 (en) Method for selecting biomarker and method for providing information for diagnosis of cancer using thereof
CN104812913A (en) Chronic obstructive pulmonary disease (COPD) biomarkers and uses thereof
JP2015089364A (en) Cancer diagnostic method by multiplex somatic mutation, development method of cancer pharmaceutical, and cancer diagnostic device
CN114360721A (en) Prognosis model of endometrial cancer related to metabolism and construction method
CN116092674A (en) Exosome-mediated gastric cancer overall survival rate prognosis model, construction method and application
CN116386886A (en) Model and apparatus for predicting recurrence of cancer patients
CN113234823B (en) Pancreatic cancer prognosis risk assessment model and application thereof
CN114678062B (en) Hepatocellular carcinoma prognosis prediction system based on multiple sets of chemical characteristics and prediction method thereof
CN116013525A (en) Colorectal cancer prognosis model constructed based on iron death characteristics and construction method thereof
CN116064822A (en) Lung adenocarcinoma diagnosis model and prognosis model based on machine learning and single cell transcription map
CN112481380B (en) Marker for evaluating anti-tumor immunotherapy reactivity and prognosis survival of late bladder cancer and application thereof
WO2017221744A1 (en) METHOD FOR PROVIDING DATA FOR LUNG CANCER TEST, LUNG CANCER TEST METHOD, LUNG CANCER TEST DEVICE, PROGRAM AND RECORDING MEDIUM OF LUNG CANCER TEST DEVICE, AND miRNA ASSAY KIT FOR LUNG CANCER TEST
CN114898874A (en) Prognosis prediction method and system for renal clear cell carcinoma patient
KR20210142237A (en) Analytical method for predicting prognosis to recurrence and metastasis in triple negative breast cancer patients
Druskin et al. INCORPORATING PROSTATE HEALTH INDEX DENSITY, MRI, AND PRIOR NEGATIVE BIOPSY STATUS TO IMPROVE THE DETECTION OF CLINICALLY SIGNIFICANT PROSTATE CANCER: MP35-08
Yuan et al. A model to predict a risk of allergic rhinitis based on mitochondrial DNA copy number
CN115820855B (en) Application of HDC, SMPDL3A, IRF and AQP3 in preparation of reagent and kit for diagnosing CML
CN113699235B (en) Application of immunogenic cell death related gene in head and neck squamous cell carcinoma survival prognosis and radiotherapy responsiveness
WO2021241527A1 (en) Method for providing information for predicting effect of chemotherapy on non-small cell lung cancer and information provision kit, method for predicting effect of chemotherapy on non-small cell lung cancer, prediction system for predicting effect of chemotherapy on non-small cell lung cancer, and program and recording medium of prediction system
CN117476097B (en) Colorectal cancer prognosis and treatment response prediction model based on tertiary lymphoid structure characteristic genes, and construction method and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination