CN116735889A

CN116735889A - Protein marker for early colorectal cancer screening, kit and application

Info

Publication number: CN116735889A
Application number: CN202310049892.3A
Authority: CN
Inventors: 廖鲁剑
Original assignee: Hangzhou Durbrain Medical Inspection Laboratory Co ltd
Current assignee: Hangzhou Durbrain Medical Inspection Laboratory Co ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-09-12
Anticipated expiration: 2043-02-01

Abstract

The application discloses a protein marker combination for colorectal cancer prediction, diagnosis or prognosis, and belongs to the technical field of cancer proteomics detection. The protein marker combination includes at least one selected from LRG1, SERPINA1, ITIH3, CP, ORM1, C9, IGFBP2, and CNDP1. The application also provides application and a system based on the protein marker combination. The protein marker combination of the application provides a non-invasive screening means based on plasma for the prediction of early colorectal cancer, even premalignant lesions. The method and the system of the application are used for predicting, diagnosing or prognosing colorectal cancer, have no wound on patients, convenient material acquisition, small blood plasma sample amount, high sensitivity and specificity, and most importantly fill the blank that the early colorectal cancer has no effective protein marker.

Description

Protein marker for early colorectal cancer screening, kit and application

Technical Field

The application belongs to the technical field of cancer proteomics detection, and particularly relates to a protein marker for early colorectal cancer screening, a kit and application.

Background

Colorectal cancer is one of the five major causes of cancer death worldwide. In the united states, colorectal cancer incidence rates are third and mortality rates are second. Similarly, colorectal cancer is also a highly malignant tumor that severely affects the health of the national people in China, and the morbidity and mortality rate of colorectal cancer are ranked in the top three among all malignant tumors. The main reason for the low survival rate of colorectal cancer patients is the lack of effective early diagnosis of early stage intestinal cancer. A number of clinical practices have shown that patients who have undergone surgery in the early stages of tumorigenesis (stage I or IIa) have a five-year survival rate of 90%, whereas patients who have undergone surgery in the late stages (stage III and IV) have a five-year survival rate of less than 10%. Colorectal cancer often evolves from precancerous to diffuse metastatic malignancy for 10-15 years, so making early diagnoses of cancer cells before they diffuse metastasis is of great importance to improve survival in patients.

The main means of the existing colorectal cancer screening in clinic comprise colorectal microscopy, imaging examination, fecal occult blood test, DNA detection, CEA and other protein markers detection and the like. The conventional technology is invasive or generates radiation damage, and more importantly, the sensitivity is low, so that the conventional technology is difficult to be used for early screening of large-scale risk groups, and the tolerance and the acceptance of common groups to enteroscopes are low. The only non-invasive detection means applied to clinic is the chemical and immunological detection of fecal occult blood, but the sensitivity of the detection on colorectal cancer is only 61-79% on the premise of 86-95% specificity, and the detection rate of early colorectal cancer is difficult to meet clinical requirements although the detection method is widely applied to clinic.

In recent years, liquid biopsy technology has been developed rapidly, and the problem of lower sensitivity of the traditional detection technology is solved to a certain extent. For example, the methylation products of Septin9 gene in blood plasma (Epi protocol), the detection of BMP3/NDRG4 methylation in feces in combination with KRAS gene mutation and the early colorectal cancer screening products of FIT (Cologuard) are used, and these noninvasive novel screening technologies create a new era of early diagnosis of colorectal cancer. However, there is still a great room for improvement in the sensitivity and specificity of these detection techniques. For example, the Epi protocol assay has 97.5% specificity, but only 79% sensitivity, which can lead to a large proportion of missed diagnoses. Cologard can reach a sensitivity of 95.55%, but its specificity is reduced to 87.1%. Meanwhile, the sensitivity and the specificity are improved, the detection accuracy can be better improved, and the probability of missed diagnosis and misdiagnosis is reduced as much as possible. In addition, protein markers such as CEA detection have more limited sensitivity and specificity.

In recent years, proteomics based on high-resolution mass spectrometers greatly improves detection accuracy and increases detection speed, and is gradually suitable for analyzing the proteomic expression level of large-scale clinical samples. Over the years of practice, it is widely recognized by the industry that high sensitivity and high specificity early cancer screening strategies require shifting from single protein markers to combined markers. At present, there is no early screening diagnostic kit for colorectal cancer based on protein markers in clinic.

Disclosure of Invention

In order to solve at least one of the technical problems, the application adopts the following technical scheme:

the first aspect of the present application provides a protein marker combination for colorectal cancer prediction, diagnosis or prognosis, comprising at least one selected from LRG1, SERPINA1, ITIH3, CP, ORM1, C9, IGFBP2, CNDP1. ITIH3: heavy chain H3 of the meta alpha trypsin inhibitor, the complex can stabilize the extracellular matrix by its ability to bind hyaluronic acid. Polymorphism of this gene may be associated with increased risk of schizophrenia and major depression.

LRG1: belongs to the family of leucine-rich repeats, and plays an important role in protein-protein interactions, signal transduction, intercellular adhesion and development processes.

C9: this protein is the last component of the complement system and is involved in the formation of the Membrane Attack Complex (MAC). Membrane attack complexes play a key role in innate and adaptive immune responses.

IGFBP2: the protein can bind insulin-like growth factors I and II (IGF-I and IGF-II), can better bind IGF-I and IGF-II after being secreted into blood, and can also act with different ligands in cells. High expression of IGFBP2 may promote the growth of a variety of tumors and may allow for the prognosis of a patient.

CNDP1: the protein is one of M20 metalloprotease family members, specifically expressed in brain, and coding region of gene Contains Trinucleotide (CTG) repetitive sequence.

SERPINA1: the protein is a serine protease inhibitor, belongs to serine superfamily, and its action targets include elastase, plasmin, thrombin, trypsin, chymosin and plasminogen activator. The protein is produced by lymphocytes and monocytes in liver, bone marrow, lymphoid tissues, and pantyhose cells of the gut. It is known that the deficiency of this gene is associated with chronic obstructive pulmonary disease, emphysema and chronic liver disease.

CP: the protein is a metallic protein, can bind most of copper in plasma, and is involved in the peroxidation of iron (II) transferrin to iron (III) transferrin. This gene mutation leads to acute plasmin, iron accumulation and tissue damage, and is associated with diabetes and neurological abnormalities.

ORM1: the protein belongs to acute stage plasma protein. In the acute inflammatory response, the expression level increases. The specific function of the protein is unknown and may be involved in immunosuppression.

In some embodiments of the application, the protein marker combination comprises LRG1, further comprising at least one of SERPINA1, ITIH3, CP, ORM1, C9, IGFBP2, and CNDP1.

In other embodiments of the application, the protein marker combination comprises C9 and further comprises at least one of LRG1, SERPINA1, ITIH3, CP, ORM1, IGFBP2, and CNDP1.

In some embodiments of the application, the protein marker combination comprises ITIH3, LRG1, C9, IGFBP2, and CNDP1.

In some embodiments of the application, the protein marker combination comprises CP, LRG1, C9, IGFBP2, and CNDP1.

In some embodiments of the application, the protein marker combination comprises ITIH3, CP, LRG1, C9 and CNDP1.

In some embodiments of the application, the protein marker combination comprises SERPINA1, LRG1, C9, IGFBP2, and CNDP1.

In some embodiments of the application, the protein marker combination comprises SERPINA1, CP, LRG1, C9, and CNDP1.

In some embodiments of the application, the protein marker combination comprises LRG1, ORM1, C9, IGFBP2, and CNDP1.

In some embodiments of the application, the protein marker combination comprises LRG1, SERPINA1, CP, ORM1, C9, and CNDP1.

In some embodiments of the application, the protein marker combination comprises LRG1, SERPINA1, ITIH3, CP, C9, and CNDP1.

In some embodiments of the application, the protein marker combination comprises LRG1, SERPINA1, ITIH3, C9, IGFBP2, and CNDP1.

In some embodiments of the application, the protein marker combination comprises SERPINA1, ITIH3, LRG1, C9, IGFBP2, and CNDP1.

In some embodiments of the application, the protein marker combination comprises SERPINA1, ITIH3, LRG1, ORM1, C9, and CNDP1.

In the present application, by detecting the expression level of each protein in the combination of protein markers, it is possible to predict whether a subject is at risk of having colorectal cancer, i.e., can be used for colorectal cancer early screening; it is also possible to diagnose whether the subject has colorectal cancer, which may be an auxiliary diagnosis, by the clinician in combination with other clinical indicators; a prognosis of a subject with colorectal cancer after receiving treatment can also be assessed.

In a second aspect the application provides a polypeptide combination for use in the prediction, diagnosis or prognosis of colorectal cancer, said polypeptide combination comprising at least one polypeptide from each protein in any of the protein marker combinations according to the first aspect of the application.

Optionally, the polypeptide from C9 comprises the amino acid sequence shown as SEQ ID No.1 or SEQ ID No. 2.

Optionally, the polypeptide from SERPINA1 comprises the amino acid sequence shown in SEQ ID No. 3.

Optionally, the polypeptide from ITIH3 comprises the amino acid sequence shown in SEQ ID No. 4.

Optionally, the polypeptide from CP comprises the amino acid sequence shown in SEQ ID No. 5.

Optionally, the polypeptide from LRG1 comprises the amino acid sequence shown as SEQ ID No.6 or SEQ ID No. 7.

Optionally, the polypeptide from IGFBP2 comprises the amino acid sequence set forth in SEQ ID No. 8.

Optionally, the polypeptide from KNG1 comprises the amino acid sequence shown in SEQ ID No. 9.

Optionally, the polypeptide from ORM1 comprises the amino acid sequence shown in SEQ ID No. 10.

Optionally, the polypeptide from PRDX2 comprises the amino acid sequence shown in SEQ ID No. 11.

Optionally, the polypeptide from CNDP1 comprises the amino acid sequence shown in SEQ ID No. 12.

In a third aspect, the application provides the use of a reagent for detecting the expression level of a combination of protein markers according to any one of the first aspects of the application for the preparation of a kit for the prediction, diagnosis or prognosis of colorectal cancer.

In some embodiments of the application, the detection reagent detects the expression level of each protein in the protein marker combination based on mass spectrometry.

In some embodiments of the application, the level of expression of each protein in the protein marker combination is detected by detecting the level of one or more polypeptides of each protein in the protein marker combination.

In a fourth aspect the application provides a kit for the prediction, diagnosis or prognosis of colorectal cancer comprising an expression level detection reagent for any one of the protein marker combinations of the first aspect of the application.

In a fifth aspect the present application provides a method for the prediction, diagnosis or prognosis of colorectal cancer comprising the steps of:

s1, obtaining expression level data of each protein in the protein marker combination according to any one of the first aspect of the application;

s2, constructing a machine learning model by using expression level data of each protein in the protein marker combination in the population sample and information of whether each sample is derived from colorectal cancer patients, and judging whether a subject has colorectal cancer or has risk of colorectal cancer or whether colorectal cancer prognosis is good or not based on the machine learning model.

In some embodiments of the application, the machine learning model is trained using any one of the following algorithms:

random forest algorithms, support vector machine algorithms, linear regression algorithms, logistic regression algorithms, bayesian classifiers, and neural network algorithms.

In some preferred embodiments of the application, the machine learning model is trained using a logistic regression algorithm.

Further, a preset threshold is obtained based on the machine learning model by using the population samples, and a model measurement result of each subject sample is judged to have colorectal cancer or to have a risk of having colorectal cancer or a poor prognosis of colorectal cancer if the model measurement result is higher than the preset threshold. If not higher than the preset threshold, it is judged that the colorectal cancer does not exist or the risk of suffering from the colorectal cancer does not exist or the prognosis of the colorectal cancer is good.

In some embodiments of the application, in step S1, the blood sample of the subject is anticoagulated with EDTA to obtain plasma, the plasma protein is denatured, reduced, alkylated, digested with trypsin to obtain polypeptide fragments, desalted and evaporated to dryness, and subjected to liquid phase separation and mass spectrometry to determine the level of the protein marker combination based on the level of the polypeptide.

In some embodiments of the application, the mass spectrometry detection is performed using a triple quadrupole mass spectrometry method.

In a sixth aspect the application provides a system for colorectal cancer prediction, diagnosis or prognosis comprising the following modules:

a data input module for inputting expression level data of each protein in any of the protein marker combinations of the first aspect of the present application to a subject;

the data storage module is used for storing the expression level data of each protein in the protein marker combination in the population samples and the information of whether each sample is derived from colorectal cancer patients;

the colorectal cancer analysis module is respectively connected with the data input module and the data storage module, constructs a machine learning model by utilizing the expression level data of each protein in the protein marker combination in the storage population sample stored in the data storage module and the information of whether each sample is derived from a colorectal cancer patient, and judges whether the subject has colorectal cancer or has risk of colorectal cancer or has good colorectal cancer prognosis based on the machine learning model.

In some embodiments of the application, the colorectal cancer analysis module further inputs the expression level data and the determination of each protein in the subject protein marker combination to the data storage module.

The beneficial effects of the application are that

Compared with the prior art, the application has the following beneficial effects:

and (3) detecting a plurality of protein markers in the plasma simultaneously based on the target mass spectrum, and carrying out absolute quantification, so that the result is accurate, and the time cost of detection is saved.

The protein marker combination of the application provides a non-invasive screening means based on plasma for early colorectal cancer.

The method and the system of the application are used for predicting, diagnosing or prognosing colorectal cancer, have no wound on patients, convenient material acquisition, small blood plasma sample amount, high sensitivity and specificity, and most importantly fill the blank that the early colorectal cancer has no effective protein marker.

The protein marker combination has high accuracy in predicting early colorectal cancer, and can promote patients to further diagnose after judging positive results, so that the death rate of colorectal cancer can be effectively reduced in the crowd in long term.

The machine learning is utilized to detect the marker protein of the blood plasma, so that the purpose of dynamically monitoring the disease state of a patient can be achieved.

Drawings

FIG. 1 shows the subject working characteristics of a single protein marker LRG1 with areas under the curve (AUC) of 0.904, 0.85, 0.8 for the training set, the test set and the independent validation set, respectively, where train represents the training set, test represents the test set and valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

FIG. 2 shows the subject working characteristics of a single protein marker SERPINA1 with areas under the curve (AUC) of 0.837, 0.779, 0.771 for the training set, test set and independent validation set, respectively, where train represents the training set, test represents the test set and valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

FIG. 3 shows the subject working characteristics of a single protein marker ITIH3 with areas under the curve (AUC) of the training set, test set and independent validation set of 0.835, 0.921, 0.79, respectively, where train represents the training set, test represents the test set and valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

FIG. 4 shows the subject working characteristics of a single protein marker CP with areas under the curves (AUC) of 0.823, 0.842, 0.624 for the training set, test set and independent validation set, respectively, where train represents the training set, test represents the test set, valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

FIG. 5 shows the subject working characteristics of a single protein marker ORM1 with areas under the curve (AUC) of 0.818, 0.783, 0.697 for the training set, the test set and the independent validation set, respectively, wherein train represents the training set, test represents the test set and valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

FIG. 6 shows the subject working characteristics of a single protein marker C9 with areas under the curves (AUC) of 0.875, 0.91, 0.81 for the training set, test set and independent validation set, respectively, where train represents the training set, test represents the test set, valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

FIG. 7 shows the subject operating characteristics of the single protein marker IGFBP2 with areas under the curve (AUC) of 0.728, 0.738, 0.737 for the training set, test set and independent validation set, respectively, where train represents the training set, test represents the test set, and valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

FIG. 8 shows a subject working profile for 5 protein marker combinations with areas under the profile (AUC) of 0.956, 0.954, 0.893 for the training set, test set and independent validation set, respectively, where train represents the training set, test represents the test set, valid represents the independent validation set; true positive rate (sensitivity) indicates a true positive rate (sensitivity), and False postive rate (1-specificity) indicates a false positive rate (1-specificity).

Figure 9 shows a confusion matrix of 5 protein marker combinations, with 121 colorectal cancer patients and 186 healthy individuals. 1 indicates positive, and 0 indicates negative. Wherein train represents a training set, test represents a test set, and valid represents an independent verification set; truth represents reality and Prediction represents Prediction.

Detailed Description

Unless otherwise indicated, implied from the context, or common denominator in the art, all parts and percentages in the present application are based on weight and the test and characterization methods used are synchronized with the filing date of the present application. Where applicable, the disclosure of any patent, patent application, or publication referred to in this application is incorporated by reference in its entirety, and the equivalent patents to those cited in this application are incorporated by reference, particularly as if they were set forth in the relevant terms of art. If the definition of a particular term disclosed in the prior art is inconsistent with any definition provided in the present application, the definition of the term provided in the present application controls.

The numerical ranges in the present application are approximations, so that it may include the numerical values outside the range unless otherwise indicated. The numerical range includes all values from the lower value to the upper value that increase by 1 unit, provided that there is a spacing of at least 2 units between any lower value and any higher value. For ranges containing values less than 1 or containing fractions greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is suitably considered to be 0.0001,0.001,0.01, or 0.1. For a range containing units of less than 10 (e.g., 1 to 5), 1 unit is generally considered to be 0.1. These are merely specific examples of what is intended to be provided, and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this disclosure.

The terms "comprises," "comprising," "including," and their derivatives do not exclude the presence of any other component, step or process, and are not related to whether or not such other component, step or process is disclosed in the present application. For the avoidance of any doubt, all use of the terms "comprising", "including" or "having" herein, unless expressly stated otherwise, may include any additional additive, adjuvant or compound. Rather, the term "consisting essentially of … …" excludes any other component, step or process from the scope of any of the terms recited below, as those out of necessity for operability. The term "consisting of … …" does not include any components, steps or processes not specifically described or listed. The term "or" refers to the listed individual members or any combination thereof unless explicitly stated otherwise.

In order to make the technical problems, technical schemes and beneficial effects solved by the application more clear, the application is further described in detail below with reference to the embodiments.

Examples

The following examples are presented herein to demonstrate preferred embodiments of the present application. It will be appreciated by those skilled in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the practice of the application, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the disclosure of which is incorporated herein by reference as is commonly understood by reference.

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the application described herein. Such equivalents are intended to be encompassed by the claims.

The experimental methods in the following examples are conventional methods unless otherwise specified. The instruments used in the following examples are laboratory conventional instruments unless otherwise specified; the test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores.

Example 1 discovery of protein markers

The inventors collected fresh blood samples of gender and age matched 101 colorectal cancer patients and 89 healthy human controls for the discovery of protein markers.

1. Blood sample processing

After anticoagulation treatment, 1000g of fresh blood sample is centrifuged for 5min to obtain a plasma sample, and the plasma sample is stored for a long time in a refrigerator at-70 ℃.

Plasma samples were diluted 50-fold and BCA assay concentrations were determined: BSA standards were diluted in a gradient to concentration gradients of 2, 1, 0.5, 0.25, 0.125, 0.0625mg/mL and plasma concentrations were calibrated as a working curve. The diluted sample and standard substance are respectively added into a 96-well plate, a pre-prepared BCA working solution is added, and the reaction is carried out at 37 ℃ for 30min, and the concentration of plasma protein is measured under the absorbance of 562 nm.

50 μg of protein was taken and ammonium bicarbonate solution was added to a final concentration of 50mM. DTT was added to a final concentration of 10mM and heated at 95℃for 10min. After returning to room temperature, dark reaction was performed for 30min by adding IAA at a final concentration of 15 mM. 1 mug of trypsin was added to each sample, and the reaction was carried out overnight in a metal bath at 37℃for 12-14 h. The next day, formic acid with a final concentration of 1% was added to carry out the acidification treatment to terminate the cleavage reaction.

2. Differential proteins and polypeptides

The selection of targets is first based on finding differentially expressed proteins. The inventors performed mass spectrum collection by independent collection pattern (DIA) on 190 plasma samples (89 healthy people and 101 colorectal cancer patients) with symmetrical gender and age, further analyzed by DIA-NN software to obtain expression data of proteins and polypeptides, and performed normalization analysis by total protein intensity to total 714 proteins and 7988 polypeptides. For expressing proteins and polypeptides conforming to normal distribution, the inventors found differentially expressed proteins and polypeptides using T-test, and for expressing proteins and polypeptides not conforming to normal distribution, the inventors found differentially expressed proteins and polypeptides using Wilcoxon non-parametric test. Finally, the inventors have obtained 96 differentially expressed proteins, 832 differentially expressed polypeptides. Integration yields a differentially expressed polypeptide.

3. Marker protein screening

The potential polypeptides capable of distinguishing colorectal cancer and healthy people are selected by a random forest method, average Gini coefficients of the targets are calculated by the random forest, the targets are ranked according to importance, the biological functions of the proteins are further combined, and finally 10 top-ranked proteins, namely LRG1, SERPINA1, ITIH3, CP, ORM1, C9, IGFBP2, CNDP1, KNG1 and PRDX2 are obtained, and corresponding polypeptide sequences are shown in table 1:

TABLE 1 polypeptide sequences of candidate proteins

Example 2 machine learning model establishment

C at an appropriate concentration for each polypeptide ¹³ And N ¹⁵ The labeled heavy isotope polypeptide is added to the enzyme after the enzyme digestionAnd (3) uniformly mixing the plasma samples, and then carrying out desalting and evaporating treatment by a 96-well SOLA solid-phase extraction device.

For each polypeptide, a concentration-appropriate standard curve range (9 standard curve points) is configured, and an equivalent amount of internal standard is added to each standard curve point. Mass spectrometry was performed using an AB Sciex 5500Qtrap mass spectrometer, and the polypeptides were separated using a C18 column (Phenomenex) at a set column temperature of 45 ℃ and 15 μl of standard sample was introduced. 150. Mu.L of 0.1% formic acid is added into the evaporated sample, the mixture is fully and uniformly mixed, 15. Mu.L of sample is injected for mass spectrum detection, and the conditions of liquid phase separation are shown in Table 2:

TABLE 2 conditions for separating liquid phases

Time (min)	Event(s)	Parameters (parameters)	Flow rate (ml/min)
				0.01	PumpBConc.	6	0.25
2.0	PumpBConc.	6	0.25
				18.0	PumpBConc.	28	0.25
18.5	PumpBConc.	28	0.25
				21.5	PumpBConc.	98	0.25
22	PumpBConc.	98	0.25
				25	PumpBConc.	6	0.25

Triple quaternary rod targeted mass spectrometry was then performed and the ion pair information for multiple reaction monitoring (multiple reaction monitoring, MRM) is shown in table 3.

Table 3MRM monitoring information

After mass spectrometry, the polypeptide concentrations corresponding to the respective protein markers were quantified and used for model establishment. 190 samples were randomly selected 80% (152) as training set, the remaining 20% (38) as test set, and 10 potential protein markers were further modeled as logistic regression. The inventors found that LRG1, SERPINA1, ITIH3, CP, ORM1, C9 and IGFBP2 together had 7 single protein markers, which had very good predictive power in both training and test sets, and ROC curves thereof were shown in fig. 1 to 7, respectively.

Example 3 model verification

The inventor selects 121 colorectal cancer patients and 186 matched healthy people as verification sets to verify the model. In order to more accurately quantify the polypeptide and reduce errors caused by complicated experimental treatment, the inventor does not need to perform the operation of removing the kurtosis protein, and the pretreatment cost of the experiment can be greatly reduced. And (3) extracting protein, measuring the concentration, and then carrying out liquid phase separation and mass spectrum detection.

Example 4 modeling and validation of multiple marker combinations

The inventor further utilizes the optimal combination of the aforementioned proteins-the concentration of 5 protein markers (ITIH 3, LRG1, SERPINA1, IGFBP2, and CDNP 1) to build a logistic regression model to better discriminate colorectal cancer patients from healthy people. Specifically, logistic regression modeling used 77 colorectal cancer patients and 79 healthy people to learn the distinguishing effects of 5 protein markers. A threshold of 0.34 in the logistic regression model was set and independent verification of the model was performed using 44 colorectal cancer patients and 107 healthy persons. A threshold was set based on the model results for all 307 plasma samples, and a model measurement result for each sample was determined to be positive if above this threshold. And if the model measurement result of the sample is lower than the threshold value, judging as negative.

The ROC curves are shown in fig. 8, and the area under the curve (AUC) for the training set, the test set, and the independent validation set are 0.956, 0.954, and 0.893, respectively. The final result was 92% sensitivity, 81% specificity, 94% negative predictive value, 76% positive predictive value, as shown in FIG. 9.

In addition, the inventors also presented other 10 protein marker combinations that perform well during machine learning, and the results are shown in table 4.

Table 4 protein marker combinations

Model	Training set AUC	Test set AUC	Independent validation set AUC
				CP+LRG1+C9+IGFBP2+CNDP1	0.955	0.945	0.870
ITIH3+CP+LRG1+C9+CNDP1	0.953	0.945	0.872
				SERPINA1+LRG1+C9+IGFBP2+CNDP1	0.952	0.939	0.884
SERPINA1+CP+LRG1+C9+CNDP1	0.952	0.942	0.870
				LRG1+ORM1+C9+IGFBP2+CNDP1	0.947	0.935	0.891
LRG1+SERPINA1+CP+ORM1+C9+CNDP1	0.950	0.939	0.861
				LRG1+SERPINA1+ITIH3+CP+C9+CNDP1	0.951	0.941	0.866
LRG1+SERPINA1+ITIH3+C9+IGFBP2+CNDP1	0.949	0.936	0.892
				SERPINA1+ITIH3+LRG1+C9+IGFBP2+CNDP1	0.952	0.941	0.887
SERPINA1+ITIH3+LRG1+ORM1+C9+CNDP1	0.951	0.941	0.890

All documents mentioned in this disclosure are incorporated by reference in this disclosure as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Claims

1. A protein marker combination for colorectal cancer prediction, diagnosis or prognosis, characterized in that the protein marker combination comprises at least one selected from LRG1, SERPINA1, ITIH3, CP, ORM1, C9, IGFBP2 and CNDP1.

2. The protein marker combination according to claim 1, wherein the protein marker combination comprises LRG1 and further comprises at least one of SERPINA1, ITIH3, CP, ORM1, C9, IGFBP2, and CNDP1.

3. The protein marker combination of claim 1, wherein the protein marker combination comprises C9 and further comprises at least one of LRG1, SERPINA1, ITIH3, CP, ORM1, IGFBP2, and CNDP1.

4. A combination of polypeptides for use in the prediction, diagnosis or prognosis of colorectal cancer, characterized in that the combination of polypeptides comprises at least one polypeptide from each protein in the combination of protein markers according to any one of claims 1-3.

5. Use of a reagent for detecting the expression level of a combination of protein markers according to any one of claims 1 to 3 for the preparation of a kit for the prediction, diagnosis or prognosis of colorectal cancer.

6. The use according to claim 5, wherein the detection reagent detects the expression level of each protein in the protein marker combination based on mass spectrometry.

7. A kit for colorectal cancer prediction, diagnosis or prognosis comprising an expression level detection reagent comprising a combination of protein markers comprising ITIH3, LRG1 and C9.

8. A system for colorectal cancer prediction, diagnosis or prognosis comprising the following modules:

a data input module for inputting expression level data for each protein in a subject protein marker combination comprising ITIH3, LRG1 and C9;

9. The system of claim 8, wherein the machine learning model is trained using any one of the following algorithms:

10. The system of claim 8 or 9, wherein the colorectal cancer analysis module further inputs into the data storage module expression level data and determinations of each protein in the subject protein marker combination.