CN116519954B

CN116519954B - Colorectal cancer detection model construction method, colorectal cancer detection model construction system and biomarker

Info

Publication number: CN116519954B
Application number: CN202310770060.0A
Authority: CN
Inventors: 高俊顺; 高俊莉; 李澜庆; 彭小军
Original assignee: Hangzhou Guangke Ander Biotechnology Co ltd
Current assignee: Hangzhou Guangke Ander Biotechnology Co ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-10-27
Anticipated expiration: 2043-06-28
Also published as: CN116519954A

Abstract

The invention provides a colorectal cancer detection model construction method, a colorectal cancer detection model construction system and a biomarker, which are used for screening out proteins which can be singly or combined to be used as the biomarker for early prediction of colorectal cancer occurrence risk, and products, models, systems, computer readable storage media and information data processing terminals which comprise the biomarkers and are used for predicting whether an individual is colorectal cancer or not by analyzing proteins with significant differences in blood of colorectal cancer patients and normal people through a proteomics method, so that the colorectal cancer detection model construction method, the colorectal cancer detection model construction system and the colorectal cancer detection model construction system can be used for predicting whether the individual suffers from colorectal cancer or not conveniently, noninvasively and efficiently and can meet clinical requirements.

Description

Colorectal cancer detection model construction method, colorectal cancer detection model construction system and biomarker

Technical Field

The present invention relates to the field of medicine, in particular to the use of proteomics to screen biomarkers for colorectal cancer and to apply the biomarkers for predicting whether an individual is colorectal cancer.

Background

Proteomics (Proteomics) is the science of studying the composition, location, variation and rules of interactions of proteins in cells, tissues or organisms, including the study of protein expression patterns and proteomic functional patterns. With the development of mass spectrometry technology, liquid chromatography and mass spectrometry combined technology (LC-MS/MS) have become the most dominant tool in proteomics research. The development of proteomics has important significance in searching diagnostic markers of diseases, screening drug targets, toxicology research and the like, and is also widely applied to medical research.

Colorectal cancer is one of the most common malignant tumors in clinic, and about 60% of colorectal cancer patients are older than 65 years old, and the incidence of colorectal cancer is increasing year by year due to the influence of various factors such as population aging and structural changes of eating and drinking solutions. National cancer reports of 2022 showed that colorectal cancer is second only to lung cancer, with mortality accounting for 9.5% of all cancers, and second in female cancers.

Notably, early detection of colorectal cancer is a key factor in reducing colorectal cancer mortality, as 5-year survival rates after radical surgery are about 90% when colorectal cancer is diagnosed as a localized disease; however, as the disease progresses, only 5% of patients diagnosed with distant metastasis survive for 5 years. Among the various screening methods for colorectal cancer, the Fecal Occult Blood Test (FOBT) is considered the most effective non-invasive screening method, but this method still has some limitations that are currently not overcome. With the development of immunology and molecular biology, tumor-associated protein markers show increasingly important clinical value in diagnosis and treatment of colorectal cancer, and become indispensable biological indexes for assisting diagnosis, observing curative effects and judging prognosis. Clinically, a plurality of tumor markers which can be used for colorectal cancer diagnosis, pathological typing and clinical staging and prognosis and curative effect judgment are found, but the diagnosis efficacy of the colorectal cancer markers (CEA and CA 199) which are commonly used at present is not ideal, and a specific tumor marker has higher sensitivity and specificity for colorectal cancer diagnosis.

Therefore, searching for new colorectal cancer diagnosis related markers and various marker combinations and constructing a colorectal cancer diagnosis prediction model have important clinical value and significance.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a biomarker for colorectal cancer detection, a model, a system, a computer-readable storage medium and an information data processing terminal for predicting whether an individual is colorectal cancer, which can be used for predicting whether the individual is colorectal cancer conveniently, noninvasively and efficiently and meet clinical requirements.

In particular, in one aspect, the invention provides the use of a biomarker selected from one of ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2 in the manufacture of a reagent for predicting whether an individual is colorectal cancer.

According to the invention, a research and development team analyzes two groups of blood samples of a healthy group and a colorectal cancer patient group by using a TMT (total length, mean time and mass) marking quantitative proteomics research and an LC-MS/MS (liquid chromatography-mass spectrometry) ultra-high performance liquid chromatography-tandem mass spectrometry method, and judges proteins with obvious differences between the colorectal cancer sample and a control sample by using an orthogonal partial least square method to obtain proteins related to colorectal cancer, and the proteins can be used as biomarkers for efficiently predicting whether individuals have colorectal cancer.

In another aspect, the invention provides the use of a biomarker selected from the group consisting of a combination of at least two of ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, CEA, CA199 in the manufacture of a reagent for predicting whether an individual is colorectal cancer. The research and development team combines ORM1 and colorectal cancer markers CEA and CA199 commonly used at present, provides at least 2 biomarkers selected from the 10 proteins, and the colorectal cancer diagnosis model constructed based on the biomarkers has better diagnosis value, so that whether an individual is colorectal cancer or not can be predicted more accurately.

Further, the present development team prefers combinations of orders of magnitude of the proteins comprised by the above biomarkers selected from the group consisting of at least two of ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, or from the group consisting of 1 or more of ORM1, CEA, CA199 and 1 or more of ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT 2.

Still further, the biomarker is selected from a combination of at least two of FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, or from a combination of any 1 or more of ORM1, ORM2, CD74, CEA, CA199 with any 1 or more of FBLN5, RNASE1, ITIH3, SPINK5, B3GNT 2.

Still further, the biomarker may be selected to comprise RNASE1 and SPINK5; or RNASE1, SPINK5 and B3GNT2; or ORM2, RNASE1, SPINK5 and B3GNT2; or ITIH3, FBLN5, RNASE1, SPINK5 and B3GNT2; or CA199, ORM2, RNASE1, ITIH3, SPINK5 and B3GNT2; or CA199, CEA, ORM1, FBLN5, RNASE1, SPINK5 and B3GNT2; or CA199, CEA, ORM2, CD74, RNASE1, ITIH3, SPINK5 and B3GNT2; or CA199, CEA, ORM1, ORM2, CD74, FBLN5, RNASE1, SPINK5 and B3GNT2; or CA199, CEA, ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5 and B3GNT2.

It should be noted that in the present invention, the mucin 1 (ORM 1) is a protein or amino acid sequence of UniProt database No. P02763, the mucin 2 (ORM 2) is a protein or amino acid sequence of UniProt database No. P19652, the CD74 molecule (CD 74) is a protein or amino acid sequence of UniProt database No. P04233, the mouse hybridoma cell 5 (FBLN 5) is a protein or amino acid sequence of UniProt database No. Q9UBX5, the ribonuclease family member 1 (RNASE 1) is a protein or amino acid sequence of UniProt database No. P07998, the alpha-trypsin inhibitor heavy chain 3 (ITIH 3) is a protein or amino acid sequence of UniProt database No. Q06033, the serine peptidase inhibitor Kazal 5 (SPINK 5) is a protein or amino acid sequence of UniProt database No. Q9NQ38, the beta-1, 3-N-acetylglucosyltransferase 2 (B3) is a protein or amino acid sequence of UniProt 9NQ 9A 35, and the alpha-trypsin inhibitor heavy chain 3 (ITIH 3) is a protein or amino acid sequence of UniProt database No. Q35A 9 or the antigen protein or amino acid sequence of UniProt protein or protein (Yb 35A) of UniProt 9.Ch 35.6.

In another aspect, the invention provides a biomarker for predicting whether an individual is colorectal cancer, the biomarker being selected from the group consisting of at least two of ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, CEA, CA 199. Further, the research and development team of the present invention prefers biomarkers used in the agent to achieve a better technical result in predicting whether an individual is colorectal cancer.

In particular, in a further aspect, the invention provides a product for predicting whether an individual is colorectal cancer, the product comprising a kit or chip comprising a biomarker for the use as described above. In some embodiments, the biomarkers useful for predicting whether an individual is colorectal cancer can be used to prepare detection reagents for detection targets, such as sample pretreatment reagents, biological reagents and kits suitable for detection of the biomarkers, such as antigens or antibodies; standardized reagents or kits etc. suitable for the biomarkers can also be developed; in some embodiments, the detection reagent is an antibody to a biomarker as described above, which is a monoclonal antibody. Furthermore, the research and development team of the invention optimizes the biomarker contained in the kit or chip in the product for predicting whether the individual is colorectal cancer, so as to improve the accuracy of product detection.

In another aspect, the invention provides a method of constructing a model for predicting whether an individual is colorectal cancer, the method comprising:

(1) Data acquisition, setting a model group, and acquiring the concentration of a biomarker in serum of a sample of the model group; wherein the model group comprises colorectal cancer group samples and healthy control samples, and the detected biomarker is selected from a combination of at least two of ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, CEA and CA 199;

(2) The model construction comprises the following steps:

s201, adopting biomarker concentration of samples in a model group as an original training data set, dividing the original training data set into K subsets according to a K-fold cross validation mechanism, selecting one subset as a validation set Ddev, and combining unselected subsets to form a training data pool Dtrain;

s202, selecting a generalized linear model (glmcet) algorithm for constructing a prediction model and a grid search range in a hyper-parameter optimization process of the algorithm, and determining parameters constructed by the prediction model;

s203, based on the training data pool Dtrain obtained in S201, constructing a prediction model by adopting the algorithm and the super parameters selected in S202.

Furthermore, the research and development team optimizes the composition of the biomarker used in the model construction method and evaluates the verification set Ddev to obtain an AUC value which can be used as a final performance evaluation value of the model.

In another aspect, the invention features a system for predicting whether an individual is colorectal cancer, the system comprising:

the data acquisition module is used for acquiring the concentration of a biomarker in serum of a model group sample, wherein the detected biomarker is selected from at least two combinations of ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, CEA and CA 199;

and (3) constructing a model module: the model is built by adopting the following steps:

s001, adopting biomarker concentration of samples in a model group as an original training data set, dividing the original training data set into K subsets according to a K-fold cross validation mechanism, selecting one subset as a validation set Ddev, and combining unselected subsets to form a training data pool Dtrain;

s002, selecting a generalized linear model (glmcet) algorithm for constructing a prediction model and a grid search range in a hyper-parameter optimization process of the algorithm, and determining parameters constructed by the prediction model;

s003, based on the training data pool Dtrain obtained in the step S001, constructing a prediction model by adopting the algorithm and the super parameters selected in the step S002.

And a prediction module: and predicting the individual by using the model constructed by the model constructing module.

In another aspect, the present invention discloses a computer readable storage medium having a computer program stored thereon; the computer program, when executed by a processor, implements the above-described method of constructing a model for predicting whether an individual is colorectal cancer.

Alternatively, the storage medium includes various media that can store program codes such as ROM, RAM, magnetic disk, or optical disk.

On the other hand, the invention discloses an information data processing terminal which is used for realizing the construction method of the model for predicting whether an individual is colorectal cancer.

Optionally, the information data processing terminal includes a processor and a memory; the memory may include RAM, and may also include non-volatile memory (NVRAM), such as at least one disk memory. The processor may be a general-purpose processor including a CPU, network Processor (NP), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

Optionally, the information data processing terminal includes a processor, a memory, and a communicator.

The invention utilizes proteomics to screen out biomarkers ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5 and B3GNT2 which can be independently used for early prediction of colorectal cancer occurrence risk, and proposes a biomarker which is selected from at least two of ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, CEA and CA199 and is used for predicting whether an individual is colorectal cancer, and a product, a model, a system, a computer-readable storage medium and an information data processing terminal for predicting whether the individual is colorectal cancer, wherein the product, the model and the system comprise the biomarker, and the computer-readable storage medium and the information data processing terminal can be used for predicting whether the individual is colorectal cancer conveniently, noninvasively and efficiently in clinical practice.

Drawings

FIG. 1 is a graph of Wilcoxon results for the healthy control and colorectal cancer groups of example 1;

FIG. 2 is a graph of the results of ROC and OPLS-DA analyses of the healthy control and colorectal cancer groups of example 1;

FIG. 3 is a graph of AUC results of a model constructed under different combinations of super parameters of the glmnet algorithm in example 3;

FIG. 4 is a ROC curve of the colorectal cancer joint diagnosis model constructed in example 3 in the model group;

FIG. 5 is a ROC curve of the colorectal cancer combined diagnostic model constructed in example 3 in the test group;

FIG. 6 is a graph showing the results of performance evaluation of the colorectal cancer joint diagnosis model constructed in example 3 in a test group;

FIG. 7 is a comparison of the area under the ROC curve of diagnostic models constructed from different protein combination biomarkers in example 4;

FIG. 8 is a comparison of the area under the ROC curve of the colorectal cancer diagnostic model (10 MP) of example 4 with conventional markers and combinations thereof;

FIG. 9 is a system for predicting whether an individual is colorectal cancer as shown in example 5.

Note that the "Log-transformed corrected P value" shown in the drawings is used to characterize-Log 10 adjust P value; the "generalized linear model hyper-parameters" shown are used to characterize the glrnet model hyper-parameters.

Detailed Description

(1) Diagnosis or detection

Diagnostic or test herein refers to the detection or assay of a biomarker in a sample, or the level of the biomarker of interest, such as absolute or relative, and then indicating whether the individual providing the sample is likely to have or suffer from a disease, or the likelihood of having a disease, by the presence or amount of the biomarker of interest. The diagnostic and detection meanings are interchangeable herein. The result of such detection or diagnosis is not directly as a direct result of the disease, but is an intermediate result, and if a direct result is obtained, it is also necessary to confirm that the patient has a disease by other auxiliary means such as pathology or anatomy. For example, the present invention provides a number of novel biomarkers that have relevance to colorectal cancer, and changes in the levels of these markers have a direct relevance to whether or not colorectal cancer is present.

(2) Association of markers or biomarkers with colorectal cancer

Markers and biomarkers have the same meaning in the present invention. The association here means that the presence or change in the amount of a biomarker in a sample has a direct correlation with a particular disease, e.g. a relative increase or decrease in the amount, indicating a higher likelihood of such a disease than a healthy person.

If multiple different markers are present in the sample at the same time or in a relatively varying amount, this is indicative of a higher likelihood of suffering from the disease than for healthy persons. That is, some markers have strong association with a disease, some markers have weak association with a disease, or some are even not associated with a particular disease among the marker categories. One or more of the markers with strong association can be used as a marker for diagnosing diseases, and the markers with weak association can be combined with the markers with strong association to diagnose a certain disease, so that the accuracy of detection results is improved.

For the numerous biomarkers found in the serum of the present invention, these markers can be used to distinguish colorectal cancer from healthy persons. The markers herein may be used alone as individual markers for direct detection or diagnosis, and selection of such markers indicates that a relative change in the content of the markers has a strong correlation with colorectal cancer. Of course, it will be appreciated that simultaneous detection of one or more markers strongly associated with colorectal cancer may be selected. It is well understood that in some embodiments, the selection of highly correlated biomarkers for detection or diagnosis may be accurate to a standard, such as 60%,65%,70%,80%,85%,90% or 95% accuracy, and that these markers may be used to obtain intermediate values for diagnosing a disease, but are not indicative of a direct confirmation of a disease.

Of course, a differential protein with a larger ROC value may also be selected as a diagnostic marker. So-called strong, weak are typically confirmed by some algorithm, such as marker and colorectal cancer contribution rate or weight analysis. Such a calculation method may be significance analysispValues or FDR values) and Fold change (Fold change), the multivariate statistical analysis mainly comprises Principal Component Analysis (PCA), partial least squares discriminant analysis (PLS-DA) and orthogonal partial least squares discriminant analysis (OPLS-DA), but also other methods such as ROC analysis, etc. Of course, other model predictive methods are possible, and the differential proteins disclosed herein may be selected when specifically selecting biomarkers, or may be predicted by model methods in combination with other known combinations of markers.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, it being noted that the examples described below are intended to facilitate an understanding of the invention and are not intended to limit the invention in any way. The reagents used in this example are all known products and are obtained by purchasing commercially available products.

Example 1 screening of biomarkers for colorectal cancer Using proteomics

1.1 collection of samples

The study panel collected 50 colorectal cancers and 50 healthy controls from 2021.7-2021.12, with all patients in the panel signed informed consent. Colorectal cancer patients are all results of pathological confirmation of living tissues, and healthy controls are normal physical examination. Inclusion criteria for colorectal cancer patients: (a) No history of other malignancy, (b) surgical treatment within one month after blood collection, and post-operative pathology confirmed colorectal cancer. Healthy persons of the control group were selected from the physical examination center; the gastrointestinal examination confirms that the patient has no gastrointestinal lesions, has no other serious diseases in physical examination, and has age and sex matched with the case. After informed consent, all serum samples collected were stored in a serum pool at-80 ℃.

Sample processing and enzymolysis

First, the plasma samples were centrifuged on a centrifuge for 15 minutes (15000 xg), and the supernatant was collected and filtered, followed by immunoaffinity chromatography to remove 14 high abundance proteins. Then concentrated on a centrifuge (4000 Xg,1 hour) with a concentration tube having a molecular weight cut-off of 3 kDa. The concentrate was recovered and subjected to solution displacement (Buffer Exchange) on a centrifuge (1000 Xg,2 minutes) using a desalting column having a molecular weight of 7kDa, the displacement solution being AEX-A (20mM Tris,4M Urea,3% isopanol, pH 8.0). Protein concentration in the samples was determined using the BCA method with AEX-a as a blank. According to the sample grouping case of table 1, TCEP was added to the samples and protein reduction was performed by incubation at 37 ℃ for 30 minutes. The corresponding 6-plex TMT reagent was then added and incubated at room temperature for 1 hour in the dark for TMT labelling. The samples were then buffer-displaced with a Zeba column, the displacement fluid being AEX-a. After mixing the 6-plex TMT labeled samples, 2mLAEX-A was added to the mixed samples to a final volume of 5.5. 5.5 mL. The samples were filtered using a 0.22 m filter and the 6-plex TMT-labeled samples were separated using a 2D-HPLC system. The collected fractions were freeze-dried, and finally, a Trypsin-Lysin C mixed enzyme was added, the samples were incubated at 37℃for 5 hours to perform enzymolysis, and 5. Mu.L of 10% TFA was added to terminate the enzymolysis reaction. A total of 60 digested 2D-HPLC fractions were used for nano-LC-MS/MS analysis.

Table 1: proteomics study sample grouping

1.3 LC-MS/MS data acquisition and search analysis

The LC-MS/MS system is Easy-nLC 1200 and Q exact HFX, and the mobile phase A is aqueous solution containing 0.1% formic acid and 2% acetonitrile; mobile phase B was an aqueous solution containing 0.1% formic acid and 80% acetonitrile. The homemade analytical column had a length of 20cm and was filled with ReproSil-Pur C18,1.9 μm particles of Dr. Maisch GmbH. 1. Mu.g of peptide fragment was dissolved in mobile phase A and separated using EASY-nLC 1200 ultra high performance liquid phase system. Setting a liquid phase gradient: 0-26 min,7% -22% of B;26-34 min,22% -32% of B;34-37 min,32% -80% of B;37-40 min,80% B, liquid flow rate maintained at 450 nL/min.

Injecting the peptide segment separated by the high performance liquid phase system into a NanoFlex ion source for atomization, and then, feeding the peptide segment into Q exact HF-X for mass spectrometry. The ion source voltage is set to 2.1 kV, the primary mass spectrum scanning range is set to 400-1200, and the Resolution is 60,000 (MS Resolution); the start of the secondary mass spectrum scan range was 100 m/z, and the Resolution was set to 15,000 (MS 2 Resolution). Data dependent scanning (DDA) mode setting TOP 20 parent ions enter an HCD collision cell sequentially for fragmentation and then sequentially carry out secondary mass spectrometry. The Automatic Gain Control (AGC) is set to 5E4, the signal threshold is set to 1E4, and the maximum injection time is set to 22 ms. To avoid repeated scans of high abundance peptide fragments, the dynamic exclusion time for tandem mass spectrometry was set to 30 seconds.

Mass spectrum data obtained by LC-MS/MS were retrieved using Maxquat (v1.6.15.0). The data type is TMT proteomic data based on secondary reporter ion quantification, and the secondary spectrogram for quantification requires a parent ion ratio of greater than 75% in the primary spectrogram. Database source Uniprot database homo_sapiens_9606_protein (release: 2021-10-14, sequence: 20614), and common pollution library is added into the database, and pollution proteins are deleted during data analysis; the enzyme cutting mode is set as Trypsin/P; the number of the missed cut sites is set to 2; the parent ion mass error tolerance of first and Main search was set to 20ppm and 5 ppm, respectively, and the mass error tolerance of the secondary fragment ion was set to 20ppm. The fixed modification is cysteine alkylation, the variable modification is methionine oxidation and protein N-terminal acetylation. FDR was set to 1% for both protein identification and PSM identification.

Grouping samples by using orthogonal partial least square discriminant analysis, combining significance analysis, and screening differential proteins

Using single variantsScreening differential proteins by combining quantitative analysis and multivariate statistical analysis, wherein univariate analysis mainly comprises significance analysis of characteristic ions in different groups pValues or FDR values) and Fold change (Fold change), the multivariate statistical analysis mainly comprises Principal Component Analysis (PCA), partial least squares discriminant analysis (PLS-DA) and orthogonal partial least squares discriminant analysis (OPLS-DA).

We have found 581 protein substances in combination, including some newly discovered markers associated with colorectal cancer, and some markers known and confirmed to be associated with colorectal cancer (e.g., carcinoembryonic antigen (CEA), carcinoembryonic antigen (CA 199), etc.).

For the 581 protein substances, protein substances with obvious content difference are obtained through analysis. All statistical analyses were performed using R, and specific R-related information is shown in table 2.

Table 2: r and related information thereof used in the present invention

Calculating variable projection importance (Variable Importance for the Projection, VIP) to measure influence intensity and interpretation ability of expression pattern of each protein on classification and discrimination of each group of samples, and further performing Wilcoxon rank sum test to obtain corrected samplespValue (FDR). Wilcoxon showed that the total content of 90 proteins in 581 protein substances was significantly reduced in the serum of colorectal cancer patients, and the content of 53 proteins in the serum of colorectal cancer patients was significantly increased (see FIG. 1 for details).

The results of ROC and OPLS-DA analysis are shown in FIG. 2, the abscissa is AUC obtained by ROC analysis, the ordinate is VIP value obtained by OPLS-DA analysis, the small and large of the dots represent p value obtained by Wilcoxon test calculation, and the color of the dots represents the significance evaluation of the VIP value.

According to the screening criteria for differential proteins: (1) When FC is>1.2 andadj.P.Val<at 0.01, protein was down-regulated for significant differences. (2) When FC is<0.83 andadj.P.Val<at 0.01, it is a significant differenceAnd (3) abnormal up-regulating protein. Based on this screening criteria, a total of 8 more significant differential proteins were found, including biomarkers (mucin-like 1 (ORM 1), mucin-like 2 (ORM 2), CD74 molecule (CD 74), mouse hybridoma cell 5 (FBLN 5), ribonuclease family member 1 (RNASE 1), alpha-trypsin inhibitor heavy chain 3 (ITIH 3), serine peptidase inhibitor Kazal type 5 (SPINK 5), beta-1, 3-N-acetylglucosaminyl transferase 2 (B3 GNT 2)).

The present invention found 8 differential proteins that were predominantly significantly upregulated are shown in table 3:

table 3: up-regulation marker for colorectal cancer and normal health difference

The larger LogFC values and/or smaller adj.p.val values in table 3 indicate to some extent that the difference between the two groups is more pronounced, and also that the difference compound may have a higher diagnostic value.

As can be verified from table 2, among the 1256 colorectal cancer patients and the normal and healthy serum differential substances, 8 differential proteins were found to be more significantly different between colorectal cancer group and non-colorectal cancer group, and were used as markers for efficient prediction of colorectal cancer: mucin 1 (ORM 1), mucin 2 (ORM 2), CD74 molecule (CD 74), mouse hybridoma cell 5 (FBLN 5), ribonuclease family member 1 (RNASE 1), alpha-trypsin inhibitor heavy chain 3 (ITIH 3), serine peptidase inhibitor Kazal type 5 (SPINK 5), beta-1, 3-N-acetylglucosamine transferase 2 (B3 GNT 2). Among these, the most significant differences in colorectal cancer and health were identified as mucin-like 2 (ORM 2), followed by CD74 molecule (CD 74), mouse hybridoma cell 5 (FBLN 5), serine peptidase inhibitor Kazal type 5 (SPINK 5), mucin-like 1 (ORM 1), ribonuclease family member 1 (RNASE 1), beta-1, 3-N-acetylglucosamine transferase 2 (B3 GNT 2), alpha-trypsin inhibitor heavy chain 3 (ITIH 3).

Example 2: single biomarker prediction of colorectal cancer

This example demonstrates the likelihood that a single biomarker screened in example 1 is used to distinguish colorectal cancer from non-colorectal cancer, or to screen colorectal cancer patients from a population, or to predict whether an individual is a colorectal cancer patient or colorectal cancer in an individual.

Specifically, in this example, ROC curves of the 8 proteins obtained in example 1 were respectively established, and the results are shown in Table 4. In this embodiment, the advantage and disadvantage of the experimental result are determined by the area under the curve (AUC). Specifically, when AUC of 0.5 indicates no diagnostic value for a single protein; when AUC is greater than 0.5, it is indicated that individual proteins have diagnostic value; the greater the AUC, the higher the diagnostic value of the individual proteins.

Table 4: ROC analysis of differential protein ROC values of colorectal cancer and normal healthy samples and related information

It is noted that the correlation of the concentration change of the biomarker with the presence or absence of colorectal cancer can be distinguished by the AUC values, sensitivity, specificity, etc. in the table, wherein the AUC values are most intuitive and obvious. The higher the AUC value, the more accurate the biomarker is to distinguish between colorectal cancer and non-colorectal cancer populations.

From table 4, it can be verified that the concentration change of the above 8 biomarkers has obvious relevance to whether colorectal cancer is caused or not, and any one of the above biomarkers is singly adopted, and the concentration change is used for distinguishing the colorectal cancer crowd from the non-colorectal cancer crowd, so that the AUC value of the colorectal cancer crowd can reach more than 0.7, and the accuracy is high; wherein the association of CD74 molecule (CD 74) is highest, the AUC value reaches 0.838, the AUC value reaches 0.816, the RNASE1 is the ribonuclease family member 1, the AUC value reaches 0.801, and the AUC value sequentially reaches beta-1, 3-N-acetamido-glucose transferase 2 (B3 GNT 2), alpha-trypsin inhibitor heavy chain 3 (ITIH 3), serine peptidase inhibitor Kazal 5 (SPINK 5), pre-mucin 2 (ORM 2) and mucin 1 (ORM 1).

Example 3:10 protein combination biomarkers for predicting whether an individual is colorectal cancer

The colorectal cancer differential biomarker not only can be independently used as a candidate biomarker for colorectal cancer and health differential diagnosis, but also can be used for auxiliary diagnosis of colorectal cancer by selecting one or a combination of more of the colorectal cancer differential biomarkers. In general, the use of a single biomarker can be used to distinguish colorectal cancer from a serum sample of non-colorectal cancer or to make predictions of colorectal cancer, and the accuracy of the distinction or prediction is greater when multiple biomarkers are combined.

It is noted that a single biomarker that predicts colorectal cancer with greater accuracy, when combined with other biomarker(s), does not necessarily play a greater role in the combination; furthermore, the greater the number of biomarkers that are not employed, the greater the predictive accuracy (AUC value) of their combination. Therefore, in order to obtain the combined biomarker with better prediction accuracy, the research team of the invention performs a large number of verification experiments.

This example describes a model for predicting colorectal cancer constructed from mucin-like 1 (ORM 1), mucin-like 2 (ORM 2), CD74 molecule (CD 74), mouse hybridoma cell 5 (FBLN 5), ribonuclease family member 1 (RNASE 1), alpha-trypsin inhibitor heavy chain 3 (ITIH 3), serine peptidase inhibitor Kazal 5 (SPINK 5), beta-1, 3-N-acetylglucosamintransferase 2 (B3 GNT 2) 8 protein markers, and 2 conventional markers carcinoembryonic antigen (CEA) and carcinoantigen 199 (CA 199), 10 protein markers (10 MP).

Acquiring data

Study population: 300 colorectal cancers and 650 healthy controls were collected from 2021.7-2021.12, and all patients in the group signed informed consent. Colorectal cancer patients are all living tissues and are confirmed by pathology, and healthy controls are normal physical examination (containing nodes or not containing nodes or people without colorectal cancer). Group personnel were entered according to 7: the ratio of 3 was divided into model group (colorectal cancer n=210, healthy control n=450) and test group (colorectal cancer n=90, healthy control n=200). The data information is as in table 5:

table 5: modeling sample information

Inclusion criteria for colorectal cancer patients: (a) No history of other malignancy, (b) surgical treatment within one month after blood collection, and post-operative pathology confirmed colorectal cancer. Healthy persons of the control group were selected from the physical examination center; the gastrointestinal examination confirms that the patient has no gastrointestinal lesions, has no other serious diseases in physical examination, and has age and sex matched with the case. After informed consent, all serum samples collected were stored in a serum pool at-80 ℃.

In this example, ELISA was performed on collected serum samples to obtain the concentration of 10 protein markers of mucin-like 1 (ORM 1), mucin-like 2 (ORM 2), CD74 molecule (CD 74), mouse hybridoma cell 5 (FBLN 5), ribonuclease family member 1 (RNASE 1), alpha-trypsin inhibitor heavy chain 3 (ITIH 3), serine peptidase inhibitor Kazal type 5 (SPINK 5), beta-1, 3-N-acetylglucosamine transferase 2 (B3 GNT 2), carcinoembryonic antigen (CEA) and carcinoembryonic antigen 199 (CA 199).

Statistical analysis of experimental data

The Shapiro Wilk test was used to evaluate normal distribution and the non-parametric test Wilcoxon test was used to analyze differences in blood marker concentrations between colorectal cancer patients and healthy controls in the model and test groups, respectively.

In the model group, a combined diagnosis model of 10 colorectal cancer markers is constructed by adopting a method combining a plurality of machine learning methods. The predicted probability values are used to estimate the area under the Receiver Operator Characteristic (ROC) curve (AUC) with 95% Confidence Intervals (CI) to assess the discriminatory power of the multivariate diagnostic model.

Using the test set, the Youden Index (YI) was calculated to determine the predictive probability cut-off values for distinguishing colorectal cancer patients from normal controls. In addition, ROCs of individual markers and different subgroups were constructed and compared. Standard descriptive statistics, such as frequency, mean, median, positive Predictive Value (PPV), negative Predictive Value (NPV) and Standard Deviation (SD) were calculated to describe experimental results for the study population. Statistical analysis using R3.6.1, p-values less than 0.05 were considered statistically significant.

Construction of colorectal cancer diagnostic model

This example illustrates the construction of a diagnostic model for colorectal cancer using a biomarker (10 MP) comprising a combination of 10 proteins as an example.

S201, concentration matrices of 10 protein markers of mucin-like 1 (ORM 1), mucin-like 2 (ORM 2), CD74 molecule (CD 74), mouse hybridoma cell 5 (FBLN 5), ribonuclease family member 1 (RNASE 1), alpha-trypsin inhibitor heavy chain 3 (ITIH 3), serine peptidase inhibitor Kazal type 5 (SPINK 5), beta-1, 3-N-acetamido glucose transferase 2 (B3 GNT 2), carcinoembryonic antigen (CEA) and carcinoembryonic antigen 199 (CA 199) of samples in the model group were taken as raw training data sets.

S201, dividing the original training data set into K subsets according to a K-fold cross validation mechanism. In order to ensure that the proportion of most types of samples and few types of samples in each folded subset is the same as that of the original data set, a layered K-fold cross validation (layered K-Folds cross validation) mechanism is adopted to divide data, K training data subsets obtained by dividing are divided, one subset is selected to serve as a validation set Ddev, and unselected training data subsets are combined to form a training data pool Dtrain.

S202, a generalized linear model (glmcet) algorithm is selected to be used for constructing a prediction model, and a grid search range is adopted in a hyper-parameter optimization process of the algorithm. In this step, the grid search range of the hyper-parametric optimization of the model is set for each algorithm as shown in table 6.

Table 6: parameter grid search range of glmnet algorithm

S203, selecting one of the super-parameter combination modes as a parameter for constructing a prediction model according to the algorithm and the super-parameter setting range set in the step S202, and constructing the prediction model based on the selected supervised classification algorithm and the super-parameter according to the training data set Dtrain obtained in the step S201.

In addition, the construction step of this embodiment further includes:

s204, according to the prediction model obtained in the step S203, evaluating in a verification set Ddev to obtain an AUC value, and storing the current prognosis prediction model and the corresponding AUC value in a prediction model Pool for selection of a future base prediction model. The evaluation mentioned in this step may be an AUC value or other reasonable index for evaluating the performance of the model.

S205, judging whether each subset is all verified. If all the subsets are used as verification sets and training is completed, the next step S206 is continued; if there is a subset that is not used as the verification set, step S201 is performed to select the subset as the verification set Ddev. By the step, in the original data set, each sample is verified, so that the stability of the model is improved, and the model is prevented from being overfitted to a certain subset.

S206, taking the AUC average value of all models of the Pool of the prediction models as the final performance evaluation value of the model of the current combination mode. And storing the model parameters and the final performance evaluation AUC value into an optimal model Pool.

S207, judging whether all the super parameter combination modes construct a prediction model. In step S202, it is obtained whether all algorithms and corresponding hyper-parameter combinations have been subjected to the construction of the prediction model. If all the combination modes are completed to construct the model, executing the following step S208; if the combination method does not complete the construction of the model, step S203 is executed.

S208, selecting a prediction model with the highest AUC value for each algorithm from the optimal model Pool obtained after the iteration of the step S207, and storing the prediction model into a candidate prediction model set M.set for colorectal cancer diagnosis.

S209, selecting a model with the largest AUC value from the model set M.set obtained in the step S208 as a final prediction model for colorectal cancer diagnosis.

Colorectal cancer diagnostic model (10 MP) parameter optimization

By performing the model building step described above, we obtained a model built under a combination of 9 different glrnet algorithm hyper-parameters (fig. 3) and model performance was assessed by AUC values. As shown in table 7 and fig. 3: AUC reached a maximum of 0.897 when the glmnet algorithm super-parameter combination was alpha=0.55, lambda=0.0551 (AUC was calculated using 10-fold cross validation method during modeling).

Table 7: AUC of model constructed under different hyper-parameter combinations of glmnet algorithm

Therefore, the equation of the model constructed based on the optimal hyper-parametric combination constructed by using the biomarkers of 10 protein combinations in this embodiment is:

where Y is a predicted value, i denotes the i-th biomarker, m denotes the number of biomarkers (m=10), xi denotes the detection value of the i-th biomarker (μg/mL), ki denotes the coefficient of the i-th biomarker (table 8), and b is a constant 2.28584755043089.

Table 8: coefficients of 10 biomarkers in model

3.5 determination of diagnostic model for colorectal cancer (10 MP) diagnostic threshold

The ROC curve is plotted with the predicted values in the model set and the optimal diagnostic cutoff is set to 0.472 based on the about log (you den) index value. Namely, when the predicted value of the diagnostic model is less than or equal to 0.472, the tested person is not considered to be a colorectal cancer patient; when the model predictive value is > 0.472, the subject is considered to be a colorectal cancer patient. The results are shown in FIG. 4: the AUC of the model in the model group was 0.886, the sensitivity was 90.6% and the specificity was 83.3%.

Colorectal cancer diagnostic model (10 MP) validation

ROC curves were plotted with the predicted values in the test set, as shown in fig. 5, with AUC 0.827. And sets the optimal diagnostic cutoff to 0.465 based on the about log (you den) index value. Namely, when the predicted value of the diagnostic model is less than or equal to 0.465, the tested person is not considered to be a colorectal cancer patient; when the model predictive value is > 0.465, the subject is considered to be a colorectal cancer patient. The results are shown in FIG. 6: the accuracy of the model in the test group was 76.1%, kappa value was 0.457, sensitivity was 59.4%, specificity was 85.0%, positive predictive rate was 67.9%, and negative predictive rate was 79.7%.

Example 4: comparison of colorectal cancer diagnostic models constructed based on biomarkers of different protein combinations

To further analyze the diagnostic value of colorectal cancer diagnostic models constructed based on biomarkers of different protein combinations, diagnostic models constructed based on biomarkers of different protein combinations were compared in the test set in this example. The results are shown in fig. 7 and table 9, with table 10 showing the coefficients of the Max AUC Panel biomarkers in table 9.

Table 9: area under ROC curve comparison of diagnostic model constructed based on different protein combination biomarkers

Table 10: coefficients of biomarkers of Max AUC Panel of diagnostic model constructed from 2MP-10MP biomarkers

Theoretically, the more markers can provide more information for disease diagnosis. The process of modeling is to explain the role of each marker in disease diagnosis. The interpretation of a part of the markers by the model may deviate, which may instead reduce the model performance in the test set. It is desirable to optimize model parameters to enhance the interpretation ability of the markers, as well as to exclude those markers that are prone to interference with the model. This process requires that the optimal combination form be found by permutation and combination.

As can be verified from tables 9, 10 and 7, as the amount of protein contained in the biomarker increases, the average AUC value of the model constructed increases, but the diagnostic value of the particular model appears more unpredictable, e.g., see max. Set of data in table 9, the AUC value of the model constructed appears to change from increasing to decreasing as the amount of protein contained in the biomarker increases, whereas min. Set, 1st Qu. Set, median set, mean set and 3rd Qu. The AUC value of the model appears to change as the amount of protein in the biomarker changes. In addition, table 9 also verifies from one side that when the number of proteins contained in the biomarker is the same, the use of different combinations of proteins will also result in different diagnostic value of the colorectal cancer diagnostic model constructed.

Furthermore, the performance of the model constructed based on the 10MP biomarker was compared in this example with the traditional markers (CEA and CA 199) and their combinations (2 MP, including CEA and CA 199) in the test group. The results are shown in fig. 8 and table 11:

table 11: colorectal cancer diagnostic model (10 MP) versus traditional markers and ROC curve area under combination thereof

As can be confirmed from fig. 8 and table 11, the diagnosis value of the colorectal cancer diagnosis model (10 MP) is significantly (p < 0.05) higher than that of the conventional marker or the conventional marker combination model by using the test result of the AUC difference significance test method.

Example 5: system for predicting whether individual is colorectal cancer

This example shows a system for predicting whether an individual is colorectal cancer, as shown in fig. 9, comprising:

a data acquisition module for acquiring the concentration of a biomarker in the serum of a model group sample, wherein the detected biomarker is selected from one of ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, or at least two of ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, SPINK5, B3GNT2, CEA, CA 199; wherein the model group comprises colorectal cancer group samples and healthy control samples;

Note that, the respective modules provided in this embodiment are similar to the methods and embodiments provided in embodiment 3 and embodiment 3, and are not described herein for brevity.

It will be appreciated by those of ordinary skill in the art that the division of the individual modules of the system of the embodiment to predict whether an individual will be colorectal cancer is merely a division of one logical function, and may be fully or partially integrated into one physical entity or physically separated in the actual implementation; and these modules may all be implemented in software, in the form of processing element calls; or all in hardware; or part of the modules are called by the processing element, and part of the modules are realized by the form of hardware. In addition, it should be noted that these modules may be fully or partially integrated together in the present embodiment, or may be implemented separately. The processing element here may be an integrated circuit with signal processing capabilities.

In the implementation of this embodiment, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form. For example, the modules above may be one or more integrated circuits configured to the risk prediction model modeling method of the present invention, such as one or more specific integrated circuits, or one or more microprocessors, or one or more field programmable gate arrays, or the like. For another example, when the above modules are implemented in the form of processing element program code, the processing element may be a general purpose processing element, such as a central processing unit or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-chip.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. A method of constructing a model for predicting whether an individual is colorectal cancer, the method comprising:

(1) Data acquisition, setting a model group, and acquiring the concentration of a biomarker in serum of a sample of the model group; wherein the model group comprises colorectal cancer group samples and healthy control samples, and the detected biomarkers are the combination of SPINK5 and ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, B3GNT2, CEA and CA 199;

(2) The model construction comprises the following steps:

s202, selecting a generalized linear model algorithm for constructing a prediction model and a grid search range in a super-parameter optimization process of the algorithm, and determining parameters constructed by the prediction model;

2. The method according to claim 1, further comprising S204 of calculating AUC values of the model as final performance evaluation values of the model by ROC method in the validation set Ddev according to the prediction model obtained in S203.

3. The method of constructing a model for predicting whether an individual is colorectal cancer as claimed in claim 2, wherein the equation for constructing the model based on the hyper-parametric combination is:

wherein Y is a predicted value, i represents an i-th biomarker, m represents the number of proteins combined in the biomarker, xi represents a detected value of the i-th protein contained in the biomarker, ki represents a coefficient of the i-th biomarker, and b is a constant.

4. A system for predicting whether an individual is colorectal cancer, the system comprising:

and a data acquisition module: obtaining the concentration of a biomarker in serum of a sample of a model group, wherein the biomarker to be detected is a combination of SPINK5 and ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, B3GNT2, CEA and CA 199; wherein the model group comprises colorectal cancer group samples and healthy control samples;

S002, selecting a generalized linear model algorithm for constructing a prediction model and a grid search range in a super-parameter optimization process of the algorithm, and determining parameters constructed by the prediction model;

s003, constructing a prediction model by adopting the algorithm and the super parameters selected in the S002 based on the training data pool Dtrain obtained in the S001;

5. The system for predicting whether an individual is colorectal cancer of claim 4, further comprising S004, calculating AUC values at the validation set Ddev using ROC method as final performance assessment value of the model according to the prediction model obtained in S003.

6. The system for predicting whether an individual is colorectal cancer of claim 5, wherein the equation for constructing the model based on the hyper-parametric combination is:

7. A computer readable storage medium having a computer program stored thereon; the computer program, when executed by a processor, implements a method of constructing a model of any one of claims 1-3 for predicting whether an individual is colorectal cancer.

8. An information data processing terminal, characterized by implementing a method of constructing a model for predicting whether an individual is colorectal cancer according to any one of claims 1 to 3.

9. Use of a biomarker in the preparation of a reagent for predicting whether an individual is colorectal cancer, characterized in that the biomarker is a combination of SPINK5 with ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, B3GNT2, CEA, CA 199.

10. The use according to claim 9, wherein the reagent is for detecting a biomarker in a body fluid sample.

11. The use of claim 10, wherein the detection of a marker in a body fluid sample is detection of the presence or relative abundance or concentration of a biomarker in a body fluid sample of an individual.

12. A product for predicting whether an individual is colorectal cancer comprising a kit or chip comprising reagents for detecting a biomarker, wherein the biomarker is a combination of SPINK5 and ORM1, ORM2, CD74, FBLN5, RNASE1, ITIH3, B3GNT2, CEA, CA 199.