CN116246710A - Colorectal cancer prediction model based on cluster molecules and application - Google Patents

Colorectal cancer prediction model based on cluster molecules and application Download PDF

Info

Publication number
CN116246710A
CN116246710A CN202211743182.2A CN202211743182A CN116246710A CN 116246710 A CN116246710 A CN 116246710A CN 202211743182 A CN202211743182 A CN 202211743182A CN 116246710 A CN116246710 A CN 116246710A
Authority
CN
China
Prior art keywords
colorectal cancer
genes
screening
cluster
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211743182.2A
Other languages
Chinese (zh)
Inventor
陈炳坤
马宁芳
周桂清
齐玲
彭骞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
QINGYUAN PEOPLE'S HOSPITAL
Original Assignee
QINGYUAN PEOPLE'S HOSPITAL
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by QINGYUAN PEOPLE'S HOSPITAL filed Critical QINGYUAN PEOPLE'S HOSPITAL
Priority to CN202211743182.2A priority Critical patent/CN116246710A/en
Publication of CN116246710A publication Critical patent/CN116246710A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Theoretical Computer Science (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Primary Health Care (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the technical field of biomedicine, and discloses a colorectal cancer prediction model based on cluster molecules and application thereof. The invention utilizes an open shared human protein database and a TCGA public database, uses a machine learning method to carry out intersection analysis, screens out a group of specific subset which can be detected in blood and indicates colorectal cancer disease risk, uses a GEO colorectal cancer data set to construct a logistic regression model, predicts the AUC value of colorectal cancer disease risk to be 0.962, and can accurately distinguish high risk and low risk people in a test set. The invention uses the multivariate aggregate effect to model, plays a role in multivariate commonality screening, greatly improves the prediction accuracy and sensitivity of the model, has simple and convenient hematology screening method, and is suitable for popularization in clinical application.

Description

Colorectal cancer prediction model based on cluster molecules and application
Technical Field
The invention relates to the technical field of biomedicine, in particular to a colorectal cancer prediction model based on cluster molecules and application thereof.
Background
Colorectal cancer (Colorectal cancer, CRC) is one of the most common malignant tumors of the digestive system, with incidence in the third place and high mortality in the second place, and most patients are in middle and late stages at the time of initial diagnosis due to low early diagnosis rate, and poor prognosis. Current screening for colorectal cancer relies on colonoscopy, colon CT, serology or fecal occult blood tests. Colonoscopes or colon CT have high diagnosis rate but relatively complex examination process, high cost and low popularization rate, need to prepare for intestinal tract evacuation in advance, have low patient compliance and relatively high cost, are difficult to popularize in daily physical examination, and often have delayed illness state and miss an optimal treatment window due to diagnosis when typical symptoms such as hematochezia occur. The fecal occult blood test is convenient to sample, has visual symptoms and is easy to attach to the attention of patients, but usually, the CRC patient is in an advanced stage when the fecal occult blood is positive. The clinical detection of hematology has wide application at present, and common detection indexes are CEA, CA199, CA242, CA50 and the like, and the molecules are proved to be expressed in various tumors and are often used as a cancer-making early warning signal, but the detection is limited by the factors of single detection index, low detection molecule specificity, individual difference of patients and the like of the existing serology, and the positive detection rate is still to be improved. Therefore, developing new technologies and improving early detection rate of colorectal cancer are the problems to be solved at present.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a colorectal cancer prediction model based on cluster molecules and application thereof.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
in a first aspect, the invention provides a method for constructing a colorectal cancer prediction model based on cluster molecules, which comprises the following steps:
(1) Collecting colorectal cancer transcript sequencing data and colorectal cancer data sets; extracting probe values and probe notes from the colorectal cancer dataset, removing batch effects, and obtaining a combined dataset;
(2) Collecting genes which can code for proteins in blood;
(3) Screening the colorectal cancer transcripts for differentially expressed genes in sequencing data, and validating the differentially expressed genes with the resulting pooled dataset;
(4) Screening colorectal cancer specific expression genes which can code for proteins in blood from the genes in the step (2) by using a weight gene co-expression network analysis method;
(5) Screening colorectal cancer specific protein coding genes from the colorectal cancer specific expression genes based on a machine learning method to obtain colorectal cancer cluster molecules;
(6) Verifying the credibility of the colorectal cancer cluster molecules;
(7) And training a queue based on the combined data set by using a regression method, and multiplying the colorectal cancer cluster molecule expression value by a regression coefficient to obtain a joint diagnosis score.
According to the invention, intersection analysis is carried out by utilizing a machine learning method according to the open shared human protein database and TCGA public database resources, a specific subset which can encode related proteins in blood and is used for indicating the risk of colorectal cancer is screened, a logistic regression model (prediction model) is constructed by utilizing the colorectal cancer data set in the GEO database, the method is applied to colorectal cancer hematology screening and risk assessment, the characteristics that tumor cell abnormal expression genes can be translated to generate corresponding proteins and released into blood through different ways are utilized, and the hematology detection method is selected to improve the compliance of a subject. According to the invention, the protein molecule group highly related to CRC is screened by virtue of a CRC large sample database, so that the specificity of detection molecules is ensured; in addition, the absolute value of the CRC high-expression protein cluster molecule detection is brought into a regression equation, and the obtained value is used as a comprehensive evaluation index, so that the objectivity, accuracy and sensitivity of CRC diagnosis are remarkably improved, and the CRC diagnosis is more representative.
As a preferred embodiment of the construction method according to the present invention, in step (1), the transcript sequencing data is from the The Cancer Genome Atlas database; the colorectal cancer data set is from independent colorectal cancer data sets GSE9348 and/or GSE41258 in the GENE EXPRESSION OMNIBUS database; probe values and probe annotations were extracted from the colorectal cancer dataset GSE9348 and/or GSE41258, and batch effects were removed using the Sva software package, resulting in a merged dataset.
Preferably, in step (2), the data of the gene is from the human protein database HPA and/or the human body fluid protein database HBFP. Preferably, the blood comprises whole blood, serum, plasma.
As a preferred embodiment of the construction method of the present invention, in the step (3), the differentially expressed genes are screened using limma software package, and the screening criteria are log2 FC >1.5 and FDR <0.05.
In step (4), the genes with high synergistic variation are clustered to generate corresponding modules by using a weight gene co-expression network analysis method, and the internal connection of the genes in the associated modules and the correlation between the associated modules and the clinical pathological features of colorectal cancer are analyzed to find out the core genes in the module with the highest correlation degree; the core genes are considered to be MEblue and/or MEturquoise.
As a preferred embodiment of the construction method according to the present invention, in step (5), the screening is performed using a dragline regression model, a random forest algorithm, and an SVM-RFE algorithm. The colorectal cancer specific protein coding gene is specifically and highly expressed in colorectal cancer cells, and the coded protein enters blood or body fluid in different ways; the colorectal cancer cluster molecule can be used as a colorectal cancer hematology screening and risk early warning signal. Preferably, the colorectal cancer cluster molecules are obtained by gene level screening, and their use involves expression of proteins encoded by the cluster molecules
Level detection encompasses all immunological, biological, chemical detection methods and other relevant protein detection means in the art.
As a preferred embodiment of the construction method according to the present invention, in step (6), the verification verifies the expression of the colorectal cancer cluster molecules in CRC samples by limma software package; the area under the colorectal cancer cluster molecule ROC curve AUC and 95% confidence interval were calculated using pROC software.
In step (7), the combined data set is randomly divided into a CRC training queue and a verification queue according to a ratio of 1:1, a predictive model is built in the training queue by using a logistics regression method, meanwhile, the stability of the model is verified by using a 10-fold cross verification method, the expression value of colorectal cancer cluster molecules is multiplied by a regression coefficient to obtain a combined diagnosis score, the AUC of the predictive model is calculated by using the training queue, and the accuracy of the predictive model is identified by using the verification queue.
Preferably, the numerical expression of the combined diagnostic score is: cd score = Σ (cluster molecule expression value x regression coefficient) +b, wherein Cd is conbined diagnosis; the cluster molecule expression value is a protein expression value; and B is a logistic regression constant term and is automatically generated in regression analysis.
Preferably, the evaluation of the merits of the prediction model is based on the actual joint diagnosis score, specifically, the area under the actual ROC curve (AUC value) of the subject and the prediction accuracy. Judging the accuracy and the application value of each cluster molecule in predicting colorectal cancer incidence risk by using an ROC curve: when AUC <0.5, the variable index has no prediction value, 0.5< AUC <0.7, the variable index prediction accuracy is low, the variable index prediction accuracy is medium, AUC >0.9, the evaluation index accuracy is high, and the ideal index is AUC=1.
In a second aspect, the present invention provides a colorectal cancer prediction model based on cluster molecules, which is constructed by the method.
In a third aspect, the invention uses colorectal cancer cluster molecules, including QSOX2, TGFBI, CD44, INHBA, S100a11, VEGFA and MET, in the preparation of colorectal cancer screening and/or predicting agents. Preferably, the expression combined diagnosis score of the colorectal cancer cluster molecules is applied to the preparation of colorectal cancer screening and/or predicting reagents, and can be used as early warning signals and early molecular screening means of colorectal cancer. Wherein the QSOX2 encoded secreted protein is associated with tumor proliferation; TGFBI is a tumor associated secreted protein induced by transforming growth factor β; VEGFA is associated with angiogenesis; CD44 is associated with tumor dryness; MET is associated with a tumor mutational wnt signaling pathway; S100A11 is a member of the S100 protein family, and is highly expressed in various tumors; INHBA is a member of the transforming growth factor-beta (TGF-beta) superfamily, and is involved in tumor angiogenesis and the like.
In a fourth aspect, the invention applies the colorectal cancer prediction model based on cluster molecules in the preparation of colorectal cancer screening and/or prediction reagents.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, the CRC prediction model constructed based on the colorectal cancer specificity expression cluster molecule Cd score value is used for carrying out hematology screening, and the detection index is screened from CRC sequencing databases in a plurality of databases, so that the sample size is large, and the sample size is good in representativeness. Through the verification of a clinical colorectal cancer diagnosis sample, the protein cluster molecule expression in blood of a patient with the clinical colorectal cancer is detected and substituted into a prediction model to calculate a Cd-score value, the AUC is 0.962, the accuracy is 91.9%, compared with a clinically-used CEA hematology detection result (AUC is 0.71, the accuracy is 79.7%), the accuracy of the protein cluster molecule prediction model is improved by 12.2%, and compared with colorectal lenses and CT detection, the method is simple and economical, is convenient to popularize and is suitable for early screening of CRC.
Drawings
FIG. 1 is a schematic diagram of a flow chart for screening and constructing a prediction model of a specific cluster molecule in CRC blood.
FIG. 2 is a differential expression gene cluster analysis based on TCGA-CRC dataset; a is a differential gene volcanic diagram; b is a differential expression gene heat map.
FIG. 3 is a weight gene co-expression network analysis (WGCNA); a is WGCNA soft threshold; b is a gene dendrogram and a module class; c is module gene weight analysis.
FIG. 4 is a differential expression molecular enrichment analysis.
FIG. 5 shows three algorithmic intersection analysis and differential expression molecular screening.
FIG. 6 is a validation of differentially expressed molecules based on a combined dataset of GSE9438 and GSE41258; a is colorectal cancer differential gene expression level verification; b is the ROC curve and AUC value of each gene.
FIG. 7 shows the prediction effect of a cluster molecular model and a CEA model based on a GEO merge dataset; a is a cluster molecule and CEA control model ROC curve; b is two model confusion matrices.
FIG. 8 is the expression of cluster molecules in clinical samples; a is colorectal cancer clinical sample cluster molecule mRNA detection; b is the detection of the protein level of the colorectal cancer clinical sample cluster molecule.
FIG. 9 is a predicted application of CRC cluster molecules and CEA to clinical serum samples of CRC patients; a is the ROC curve of the cluster molecule prediction model and the CEA control model; b is a confusion matrix of the cluster molecule prediction model and the CEA control model.
Detailed Description
For a better description of the objects, technical solutions and advantages of the present invention, the present invention will be further described with reference to the following specific examples. It will be appreciated by persons skilled in the art that the specific embodiments described herein are for purposes of illustration only and are not intended to be limiting.
The test methods used in the examples are conventional methods unless otherwise specified; the materials, reagents and the like used, unless otherwise specified, are all commercially available.
Examples: establishing a hematological prediction model based on colorectal cancer specific protein cluster molecule expression
The blood contains proteins which are synthesized and secreted by cells in various tissue organs or are released into the blood in other modes, and mainly comprise proteins with important physiological functions such as cytokines, growth factors, complements, antibodies, peptide hormones, immunoglobulins and the like. Rapid advances in histology and bioinformatics have advanced oncology research. The invention utilizes a global shared public database, performs data mining analysis by acquiring various malignant tumor sequencing databases, and screens specific tumor markers in blood so as to improve the clinical diagnosis rate.
The invention utilizes the colorectal cancer transcript sequencing data in TCGA and GEO databases to screen colorectal cancer differential expression genes; comparing various protein coding genes in blood, performing intersection analysis by using a dragline regression model, a random forest algorithm, an SVM-RFE algorithm and other machine learning methods, screening 7 intersected CRC characteristic protein molecules (QSOX 2, TGFBI, CD44, INHBA, CD44, VEGFA and MET), verifying by using a CRC data set in a GEO database, and constructing an equation based on a 'colorectal cancer characteristic protein cluster' hematology detection value, namely a prediction model by using a logistics regression method, wherein the equation can be used for colorectal cancer hematology screening. The technical flow is shown in figure 1.
The method comprises the following steps:
1. patient dataset screening
Data from three different datasets were used as subjects for transcript sequencing data for colorectal cancer (colorectal cancer, CRC) in The Cancer Genome Atlas (TCGA), respectively; GENE EXPRESSION OMNIBUS (GEO) database two separate colorectal cancer data sets GSE9348 and GSE41258. The three independent data sets in the two databases cover Asia and European and American populations, wherein the total number of CRC database samples in the TCGA is 699, including 55 healthy samples and 644 tumor samples; GSE9348 the total number of samples 472 after combining GSE41258 data sets, 460 samples were available, and sample summary information is detailed in Table 1.
Table 1: three independent data set clinical information summary
Figure BDA0004029452330000061
2. Determination of protein molecules that may be present in blood
Searching human protein database HPA (https:// www.proteinatlas.org /) and human humoral protein database HBFP (https:// bmbl. Bmi. Osumc. Edu/HBFP) and screening 1524 genes capable of encoding blood proteins.
3. Gene expression data processing
The FPKM values of cancer and paracancerous tissues in the TCGA-CRC dataset are subjected to gene expression analysis by using a limma software package, and differential expression genes (see figure 2 for details) are screened by taking log2 FC >1.5 and FDR <0.05 as standards, so that 325 up-regulating genes and 358 down-regulating genes are obtained in CRC. Selecting two independent CRC data sets GSE9348 and GSE41258 from a GEO database, extracting probe values and probe notes from an original database, removing batch effect by using a Sva software package, and merging the two data sets to generate a new data set; the newly generated GEO dataset was used to validate the CRC differential expressed genes screened in the TCGA database.
4. Screening of colorectal cancer specific expression genes encoding proteins related to blood by WGCNA assay
The analysis of a plurality of sample Gene expression patterns is carried out by using a Weighted Gene Co-Expression Network Analysis (WGCNA) method, genes with high cooperative variation are clustered to generate corresponding modules (Gene sets), the internal connection of genes in the associated modules and the correlation between the associated modules and clinical pathology features are analyzed, core genes (namely, the modules with high weight scores) in the modules with the highest correlation are found out, two modules (MEblue and MEturquoise) with the highest weight scores are selected as candidate Gene sets in the prediction model, 125 expression upregulators highly correlated with CRC clinical pathology features are obtained, and further Gene enrichment analysis is carried out on the obtained upregulators, so that the enrichment of most upregulators in colorectal tumor-related paths is proved (see figure 3 and figure 4 for details).
5. Colorectal cancer specific protein coding gene further screened based on machine learning method
7 intersecting CRC synchronous high expression molecules, namely QSOX2, TGFBI, CD44, INHBA, S100A11, VEGFA and MET, are screened out by using a dragline regression model, a random forest algorithm and an SVM-RFE algorithm and are collectively called CRC cluster molecules (see figure 5 for details).
6. CRC cluster molecule credibility verification based on GSE9438 and GSE41258 combined data set
The limma package was used to verify the expression of CRC cluster molecules in CRC samples, with all results being up-regulation of expression. The area under the ROC curve AUC and 95% confidence interval for CRC clusters were calculated using pROC software (see fig. 6 for details).
7. CRC cluster molecule and GEO combined data set based prediction model construction
The combined data set of GSE9438 and GSE41258 is randomly divided into a CRC training queue and a verification queue in a 1:1 mode, a prediction model is built in the training queue by using a logistics regression method, meanwhile, the stability of the model is verified by using a 10-fold cross verification method, and a cluster molecule expression value is multiplied by a regression coefficient to obtain a joint diagnosis score, wherein the numerical expression is as follows: cd score = Σ (cluster molecule expression value regression coefficient) +b, wherein the cluster molecule expression value is a protein expression value; cd: conbined diagnosis; b is a logistic regression constant term, and is automatically generated in regression analysis. Calculating the AUC of the prediction model by using a training queue, identifying the accuracy of the prediction model by using a verification queue, wherein the result shows that the AUC of the training queue is 0.97, the model prediction AUC of the verification queue is 0.93, the combined data set model prediction AUC is 0.95, and the accuracy reaches 90.2%; the control group adopts CEA serological detection, and the model predicts AUC=0.76 and the accuracy rate is 71.5%. The above results indicate that the present predictive model is able to more accurately distinguish between tumor and non-tumor populations (see fig. 7 for details).
8. Clinical CRC sample predictive analysis (model verification)
Samples of 80 cases of cancer and paracancerous pairing (patient informed consent according to the rules and signs of the clinical laboratory review board, canon, qingyuan City, inc.) (see Table 2 for details), 15 normal human serum and 60 cases of CRC confirmed serum (see Table 3 for details) were collected.
Table 2: clinical pathological information of colon cancer patient
Figure BDA0004029452330000081
Table 3: clinical pathology information of serum samples
Figure BDA0004029452330000082
The expression of 7 cluster molecules in CRC tissues is verified at mRNA and protein levels by quantitative PCR and western blotting, and the result shows that the expression level of each molecule in cancer tissues is up-regulated compared with that of the tissues beside the cancer, and the fact that the expression of the cluster molecules is heterogeneous and representative whether the gene level or the protein level is proved (see figure 8 for details). The enzyme-linked immunosorbent assay kit is used for detecting the expression quantity of the characteristic genes in serum, and a corresponding expression data set is obtained (see figure 9 for details). By applying the prediction model to carry out logistic regression analysis, the accuracy rate of the cluster molecular prediction model reaches 91.9% with AUC=0.962, which is obviously higher than the CEA single factor prediction effect (AUC=0.71, accuracy rate 79.7%), thus indicating that the cluster molecular prediction model based on the invention can more accurately distinguish CRC high risk and low risk groups.
The research result of the invention shows that the occurrence risk of colorectal cancer can be accurately predicted by carrying out hematological screening on the CRC prediction model constructed based on the colorectal cancer specific expression cluster molecule Cd score value, and a novel method is provided for improving the detection rate of colorectal cancer, so that the invention has wide application prospect.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted equally without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. The method for constructing the colorectal cancer prediction model based on cluster molecules is characterized by comprising the following steps of:
(1) Collecting colorectal cancer transcript sequencing data and colorectal cancer data sets; extracting probe values and probe notes from the colorectal cancer dataset, removing batch effects, and obtaining a combined dataset;
(2) Collecting genes which can code for proteins in blood;
(3) Screening the colorectal cancer transcript sequencing data for a differentially expressed gene, and validating the differentially expressed gene with the pooled dataset;
(4) Screening colorectal cancer specific expression genes which can code for proteins in blood from the genes in the step (2) by using a weight gene co-expression network analysis method;
(5) Screening colorectal cancer specific protein coding genes from the colorectal cancer specific expression genes based on a machine learning method to obtain colorectal cancer cluster molecules;
(6) Verifying the credibility of the colorectal cancer cluster molecules;
(7) And training a queue based on the combined data set by using a regression method, and multiplying the colorectal cancer cluster molecule expression value by a regression coefficient to obtain a joint diagnosis score, thereby obtaining the colorectal cancer prediction model.
2. The method of claim 1, wherein in step (1), the transcript sequencing data is from a The Cancer Genome Atlas database; the colorectal cancer data set is from independent colorectal cancer data sets GSE9348 and/or GSE41258 in the GENE EXPRESSION OMNIBUS database; probe values and probe annotations were extracted from the colorectal cancer dataset GSE9348 and/or GSE41258, and batch effects were removed using the Sva software package, resulting in a merged dataset.
3. The method according to claim 1, wherein in step (3), the differentially expressed genes are screened using limma software package, with a screening criteria of log2 FC >1.5, FDR <0.05.
4. The construction method according to claim 1, wherein in the step (4), the highly cooperatively-changed genes are clustered to generate corresponding modules by using a weight gene co-expression network analysis method, and the interconnectivity of the genes in the associated modules and the correlation between the associated modules and the clinical pathological features of colorectal cancer are analyzed to find out the core genes in the module with the highest correlation degree; the core genes are considered to be MEblue and/or MEturquoise.
5. The method of claim 1, wherein in step (5), the screening is performed using a dragline regression model, a random forest algorithm, and an SVM-RFE algorithm.
6. The method of claim 1, wherein in step (6), the validating validates expression of the colorectal cancer cluster molecules in CRC samples by limma software package; the area under the colorectal cancer cluster molecule ROC curve AUC and 95% confidence interval were calculated using pROC software.
7. The method of claim 1, wherein in step (7), the merged dataset is randomly divided into a CRC training queue and a validation queue at 1:1, a predictive model is built in the training queue by using a logistics regression method, the stability of the model is validated by using a 10-fold cross validation method, a joint diagnosis score is obtained by multiplying the colorectal cancer cluster molecule expression value by a regression coefficient, the AUC of the predictive model is calculated by using the training queue, and the accuracy of the predictive model is identified by using the validation queue.
8. A cluster molecule-based colorectal cancer prediction model constructed by the construction method of claims 1 to 7.
9. Use of a colorectal cancer cluster molecule for the preparation of a colorectal cancer screening and/or predicting reagent, characterized in that the colorectal cancer cluster molecule comprises QSOX2, TGFBI, CD44, INHBA, S100a11, VEGFA and MET.
10. Use of a cluster molecule-based colorectal cancer prediction model according to claim 8 for the preparation of a reagent for colorectal cancer screening and/or prediction.
CN202211743182.2A 2022-12-30 2022-12-30 Colorectal cancer prediction model based on cluster molecules and application Pending CN116246710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211743182.2A CN116246710A (en) 2022-12-30 2022-12-30 Colorectal cancer prediction model based on cluster molecules and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211743182.2A CN116246710A (en) 2022-12-30 2022-12-30 Colorectal cancer prediction model based on cluster molecules and application

Publications (1)

Publication Number Publication Date
CN116246710A true CN116246710A (en) 2023-06-09

Family

ID=86625332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211743182.2A Pending CN116246710A (en) 2022-12-30 2022-12-30 Colorectal cancer prediction model based on cluster molecules and application

Country Status (1)

Country Link
CN (1) CN116246710A (en)

Similar Documents

Publication Publication Date Title
CN107209184B (en) Marker combinations for diagnosing multiple infections and methods of use thereof
CN106198980B (en) Cancer of pancreas biomarker and application thereof
CN104620109B (en) Carcinoma of urinary bladder detection composition, kit and related methods
CA3011730A1 (en) Lung cancer biomarkers and uses thereof
CN113009122A (en) Methods and systems for determining risk of autism spectrum disorders
US20170059581A1 (en) Methods for diagnosis and prognosis of inflammatory bowel disease using cytokine profiles
US20180100858A1 (en) Protein biomarker panels for detecting colorectal cancer and advanced adenoma
US20220336043A1 (en) cfDNA CLASSIFICATION METHOD, APPARATUS AND APPLICATION
CN113355421A (en) Lung cancer early screening marker, model construction method, detection device and computer readable medium
CN116287220B (en) Molecular biomarkers and assay methods for rapid diagnosis of kawasaki disease
CN113025716A (en) Gene combination for human tumor classification and application thereof
CN115128285B (en) Kit and system for identifying and evaluating thyroid follicular tumor by protein combination
Sun et al. Serum RelB is correlated with renal fibrosis and predicts chronic kidney disease progression
CN116246710A (en) Colorectal cancer prediction model based on cluster molecules and application
CN115044665A (en) Application of ARG1 in preparation of sepsis diagnosis, severity judgment or prognosis evaluation reagent or kit
CN107121551A (en) Biomarker combinations, detection kit and the application of nasopharyngeal carcinoma
CN109813912B (en) Application of group of serum differential protein combinations in preparation of reagent for detecting autism
Zhong et al. Distinguishing kawasaki disease from febrile infectious disease using gene pair signatures
CN110993092A (en) Method for identifying liver cirrhosis and liver cancer based on N-glucose fingerprint and big data algorithm
Dodda et al. Biomarkers for Early Detection of Pancreatic Cancer: A Review
CN113846157B (en) Application of human SERPINA3 gene in wine dependence screening
CN113699235B (en) Application of immunogenic cell death related gene in head and neck squamous cell carcinoma survival prognosis and radiotherapy responsiveness
CN117476097B (en) Colorectal cancer prognosis and treatment response prediction model based on tertiary lymphoid structure characteristic genes, and construction method and application thereof
CN117007817A (en) Refractory hypertension characteristic protein marker group and screening method and application thereof
CN117037915A (en) Construction method and application of refractory hypertension crowd classification model based on multiple groups of science

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination