CN116087516A - Protein biomarker and logistic regression prediction model for gastric cancer screening - Google Patents

Protein biomarker and logistic regression prediction model for gastric cancer screening Download PDF

Info

Publication number
CN116087516A
CN116087516A CN202111316030.XA CN202111316030A CN116087516A CN 116087516 A CN116087516 A CN 116087516A CN 202111316030 A CN202111316030 A CN 202111316030A CN 116087516 A CN116087516 A CN 116087516A
Authority
CN
China
Prior art keywords
cat
lrg
normalization
log2
exp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111316030.XA
Other languages
Chinese (zh)
Inventor
陈丽萌
张曦
彭海翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Puen Haihui Medical Laboratory Co ltd
Original Assignee
Shanghai Puen Haihui Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Puen Haihui Medical Laboratory Co ltd filed Critical Shanghai Puen Haihui Medical Laboratory Co ltd
Priority to CN202111316030.XA priority Critical patent/CN116087516A/en
Publication of CN116087516A publication Critical patent/CN116087516A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • G01N33/57446Specifically defined cancers of stomach or intestine
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/573Immunoassay; Biospecific binding assay; Materials therefor for enzymes or isoenzymes
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57484Immunoassay; Biospecific binding assay; Materials therefor for cancer involving compounds serving as markers for tumor, cancer, neoplasia, e.g. cellular determinants, receptors, heat shock/stress proteins, A-protein, oligosaccharides, metabolites
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6854Immunoglobulins
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6887Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids from muscle, cartilage or connective tissue
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/72Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving blood pigments, e.g. haemoglobin, bilirubin or other porphyrins; involving occult blood
    • G01N33/721Haemoglobin
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2333/00Assays involving biological materials from specific organisms or of a specific nature
    • G01N2333/435Assays involving biological materials from specific organisms or of a specific nature from animals; from humans
    • G01N2333/46Assays involving biological materials from specific organisms or of a specific nature from animals; from humans from vertebrates
    • G01N2333/47Assays involving proteins of known structure or function as defined in the subgroups
    • G01N2333/4701Details
    • G01N2333/4712Muscle proteins, e.g. myosin, actin, protein
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2333/00Assays involving biological materials from specific organisms or of a specific nature
    • G01N2333/435Assays involving biological materials from specific organisms or of a specific nature from animals; from humans
    • G01N2333/705Assays involving receptors, cell surface antigens or cell surface determinants
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2333/00Assays involving biological materials from specific organisms or of a specific nature
    • G01N2333/81Protease inhibitors
    • G01N2333/8107Endopeptidase (E.C. 3.4.21-99) inhibitors
    • G01N2333/811Serine protease (E.C. 3.4.21) inhibitors
    • G01N2333/8121Serpins
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2333/00Assays involving biological materials from specific organisms or of a specific nature
    • G01N2333/90Enzymes; Proenzymes
    • G01N2333/902Oxidoreductases (1.)
    • G01N2333/908Oxidoreductases (1.) acting on hydrogen peroxide as acceptor (1.11)
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2333/00Assays involving biological materials from specific organisms or of a specific nature
    • G01N2333/90Enzymes; Proenzymes
    • G01N2333/988Lyases (4.), e.g. aldolases, heparinase, enolases, fumarase
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
    • Y02A50/30Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change

Abstract

The invention relates to a protein biomarker and logistic regression prediction model based on gastric cancer screening. The invention can obtain the probability prediction of gastric cancer with high specificity and sensitivity by monitoring whether the samples contain grouped gastric cancer biomarkers and using the gastric cancer biomarkers in a logistic regression prediction model.

Description

Protein biomarker and logistic regression prediction model for gastric cancer screening
Technical Field
The present invention relates to protein biomarkers for gastric cancer screening and logistic regression prediction models using the same.
Background
Gastric cancer (gastric carcinoma, GC) is a malignancy originating from the gastric mucosal epithelium. In terms of its incidence, gastric cancer is the fifth most common cancer worldwide, occupying the third place among the causes of death due to cancer, being the leading place in china. Since early gastric cancer often has no specific digestive tract expression, most gastric cancer patients are diagnosed in middle and late stages with poor prognosis and limited treatment options, and the five-year survival rate is low. However, the existing biomarkers for diagnosis and prognosis of gastric cancer are low in sensitivity and specificity, so that the diagnosis of gastric cancer is basically carried out by means of X-ray barium meal examination, fiber gastroscopy, abdominal ultrasound, spiral CT, positron emission imaging examination and the like. These methods have respective defects, such as difficulty in finding tumors with smaller tumor bodies by imaging technology and high omission rate of early screening. Invasive surgery is inconvenient to perform and has a low rate of acceptance for gastroscopy by most people. This is also one of the important reasons for high mortality of digestive tract tumors in China.
Liquid biopsy is a novel detection technique and is receiving a great deal of attention. The peripheral blood, saliva, urine or gastric lavage fluid/gastric juice and other body fluids can be used as the sources of specific biomarkers, and provide important data for screening and diagnosis of gastric cancer.
Biomarkers (biomarks) are a class of biochemical indicators used to label changes or potential changes in cellular and subcellular structures, tissues, organs, systems or functions, and can be used to determine disease stage, disease diagnosis, or to evaluate the safety and efficacy of new drugs or therapies. The protein biomarker has unique advantages in accurately and sensitively screening early and low-level damage, can provide early warning for tumor occurrence, and provides a basis for auxiliary diagnosis for clinicians.
However, there is currently much room for improvement in the screening and research of protein biomarkers.
First, most proteomics-based studies are currently under-considered for the effectiveness of protein biomarkers in subsequent practical applications, which is one of the reasons for failure of protein biomarkers in practical applications.
Second, most proteomics-based studies today do not distinguish whether the quantitative protein values are random or non-random deletions, and the same treatment is used for both deletions, which makes it possible to miss some protein biomarkers that are very distinguishable from cancer. For example, a protein is expressed substantially in healthy or benign patients, but is not expressed in most cancer patients, and is excellent for distinguishing cancers. If the average or median filling in the void values is uniformly applied to such proteins, it is highly likely that this will be filtered out during the subsequent protein biomarker screening stage, thereby missing an important protein biomarker. Only a few studies have distinguished the sources of protein quantitative value deletions and used different deletion value filling methods, and the methods used for these studies are also difficult to operate.
Again, current studies show that the data processing approach has a great impact on the screening of proteomic data. For example, the choice of data normalization method is critical for the subsequent analysis of proteomic data. There are great differences in the optimal normalization methods recommended by different documents or research materials. And considering that proteomic data has two most significant features, namely "sparsity" and "high dimensionality", this is also two of the most troublesome aspects in data processing.
Disclosure of Invention
The inventors studied on the basis of the prior art on screening of protein biomarkers, found that if a step of filtering proteins according to the ratio of quantitative values of the proteins in a tumor group or a control group is set in the protein biomarker screening, a correlation coefficient-based method is adopted to identify non-random deletion values in a protein quantitative matrix and to fill the non-random deletion values and the random deletion values in different manners, simultaneously, various commonly used standardized methods are comprehensively adopted to normalize proteomic data and respectively used for screening of protein biomarkers based on machine learning, and a characteristic screening based on machine learning is utilized to have very good performance in terms of solving the sparsity and the high dimension of the proteomic data, and a genetic algorithm-based random forest model in the machine learning is adopted for screening of protein biomarkers, an extremely excellent screening effect is obtained.
The present invention is based on the findings described above, and therefore, one aspect of the present invention relates to a serum protein biomarker for gastric cancer screening, wherein the serum protein biomarker is one or more selected from the group consisting of CA-II, HBG1, HBA1, NKEF-A, HBD, HBB, C1Inh, IGLV8-61, CA-I, PAMP, CAT and TN-X.
Another aspect of the invention relates to a detection method using a serum protein biomarker for gastric cancer screening as described above, characterized by comprising the steps of:
a) Separately determining the protein content of each serum protein biomarker in a sample;
b) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of the two proteins CA-II and CAT, and filling with a K neighbor method for the absence of the protein content of the LRG and HBG 1;
c) And (c) carrying out data standardization processing on the protein content data obtained in the step (b) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged.
Drawings
FIG. 1 is a flow chart for determining whether a sample is derived from a gastric cancer patient.
FIG. 2 is a flow chart of protein biomarker screening.
FIG. 3 is a volcanic plot of a t-test based differentially expressed protein.
FIG. 4 is a cluster map of t-test based differentially expressed proteins.
Figure 5 is a cross-validation ROC graph of logistic regression model of four serum protein biomarkers CA-II, CAT, LRG and HBG1 under eight standardized method treatments.
FIG. 6 is a cross-validation ROC plot of logistic regression models of three serum protein biomarkers CA-II, CAT, and LRG under eight standardized method treatments.
Figure 7 is a cross-validation ROC plot of logistic regression models for CAT and LRG two serum protein biomarkers under eight standardized method treatments.
FIG. 8 is a cross-validation ROC plot of logistic regression models for both CA-II and CAT protein biomarkers under eight standardized method treatments.
Figure 9 is a cross-validation ROC graph of logistic regression model for two serum protein biomarkers CA-II and LRG under eight standardized method treatments.
Figure 10 is a graph of ROC plots of logistic regression models on test sets for four serum protein biomarkers CA-II, CAT, LRG and HBG1 under eight standardized method treatments.
Figure 11 is a graph of ROC on a test set for logistic regression models of three serum protein biomarkers CA-II, CAT and LRG under eight standardized method treatments.
Figure 12 is a graph of ROC on test set for logistic regression model of two serum protein biomarkers CAT and LRG under eight standardized method treatments.
Figure 13 is a graph of ROC on a test set for logistic regression model of two serum protein biomarkers CA-II and CAT under eight standardized method treatments.
Figure 14 is a graph of ROC on test set for logistic regression model of two serum protein biomarkers CA-II and LRG under eight standardized method treatments.
Detailed Description
Hereinafter, specific embodiments of the present invention will be described.
The term "cancer" as used herein refers to the presence of cells that have characteristics typical of oncogenic cells, such as uncontrolled proliferation, immortality, metastatic potential, rapid growth and proliferation rate, as well as certain characteristic morphological features known in the art.
In one example, the "cancer" may be gastric cancer or stomach cancer. In one embodiment, "cancer" may include premalignant cancer and malignant cancer. Thus, the term "gastric cancer" encompasses all stages of gastric cancer as described by the 2020CSCO gastric cancer diagnosis and treatment guide.
In one example, the method as described herein does not involve steps performed by a physician/physician, as will be appreciated by those skilled in the art. Thus, the results obtained by the methods as described herein require the combination of clinical data and other clinical manifestations before the final diagnosis by the physician can be provided to the subject. The final diagnosis as to whether a subject has gastric cancer is a physician's scope and is not considered a part of the present disclosure.
Thus, the terms "determining," "detecting," and "diagnosing" as used herein refer to identifying a subject as having a probability or likelihood of having a disease at any stage of development (e.g., gastric cancer) or determining a subject's susceptibility to developing the disease. In one example, "diagnosis," "determination," "detection" is performed prior to manifestation of symptoms. In one example, "diagnosing," "determining," "detecting" allows a clinician/physician (in combination with other clinical manifestations) to confirm gastric cancer in a subject suspected of having gastric cancer.
As used herein, the term "sample" means a sample collected from a subject for detection of the type and amount of protein therein. The subject sample may or may not be from the circulatory system, i.e., from blood. The subject sample may be any sample comprising a suitable protein for detection, sources of which include whole blood, bone marrow, pleural fluid, peritoneal fluid, central spinal fluid, milk, urine, tears, sweat, saliva, organ secretions, irrigation fluid of bronchi, nasal cavities, throat, and the like.
In one example, the subject sample is blood, including, for example, whole blood or any portion or component thereof. Blood samples suitable for use in the present invention may be extracted from any known source including blood cells or components thereof, such as veins, arteries, peripheral, tissue, spinal cord and the like. For example, the obtained sample may be obtained and processed using well known and conventional clinical methods (e.g., procedures for drawing and processing whole blood).
In one example, the subject sample is serum. Methods for obtaining serum from blood are well known to those skilled in the art.
The present invention has found that gastric cancer can be diagnosed with high specificity and sensitivity by monitoring whether a group of gastric cancer biomarkers is contained in a sample. Especially for early gastric cancer which is difficult to diagnose in the past, the biomarker of the invention has extremely high specificity and sensitivity.
As used herein, a "biomarker" or "marker" is a biological molecule that is objectively measured as a characteristic indicator of the physiological state of a biological system. For the purposes of this disclosure, biological molecules include ions, small molecules, peptide fragments, peptide chains, proteins, peptides and proteins with post-translational modifications, nucleosides, nucleotides and polynucleotides, including RNA and DNA, glycoproteins, lipoproteins, and a variety of covalent or non-covalent modifications of these types of molecules. Biological molecules include any of these entities that are inherent, characteristic, and/or critical to the functioning of a biological system. Most of the biomarkers are polypeptides, although they may also be pre-translational forms of mRNA or modified mRNA representing the gene product expressed as a polypeptide, or they may include post-translational modifications of the polypeptide.
As used herein, "protein biomarker" refers to the biomarker comprising protein information. In one example, it refers to the biomarker comprising a protein sequence. Further, in one example, it refers to a full-length protein, a single peptide chain, a peptide fragment characteristic of a peptide chain, and stable isotope proteins or stable isotope characteristic peptide fragments thereof.
According to the invention, a group of serum protein biomarkers for gastric cancer screening is found through a series of steps including abnormal feature processing, feature filtering based on the effectiveness requirement of subsequent practical application, non-random missing value and random missing value recognition, missing value filling, abnormal sample recognition and processing, data standardization, protein biomarker screening, logistic regression model training, logistic regression model effect evaluation, logistic regression model prediction effect evaluation and the like, a serum protein biomarker for 4 cores is obtained through further screening, and 40 logistic regression prediction models capable of effectively distinguishing gastric cancer samples from gastric benign disease samples and having prediction capability are constructed based on the 4 protein biomarkers.
Some terms used in the embodiments of the present invention are enumerated below. Within the scope of the present description and claims, the relevant terms are defined as follows. Other terms not listed are defined as commonly used in the art, the meaning of which is well known to those skilled in the art.
Non-random deletions: it means that due to the problem of the sample itself, there is no detection of the protein quantitative data of a part of the sample, i.e. a non-random deletion of the protein quantitative data occurs.
Random deletion: the method is characterized in that in the actual protein quantitative process, due to random disturbance of an instrument, protein quantitative data of a part of samples cannot be detected, namely random deletion of the protein quantitative data occurs.
K neighbor method filling: one of the well-known missing value filling methods is to fill the missing value into the sample with the missing value by utilizing the comprehensive situation of a plurality of neighbors of the sample with the missing value.
Log2 normalization: i.e., log2 conversion, performs a 2-based logarithmic conversion of the expression values in the expression matrix.
Log2+ Median normalization: for the expression values in the expression matrix, log2 conversion is first performed, and for each protein expression value of each sample in the converted matrix, the median value of all protein expression values of this sample is divided by the median value of all protein expression values of all samples.
Log2+ CycLoess normalization: for the expression values in the expression matrix, log2 conversion is firstly carried out, and then the converted expression matrix is normalized by adopting local weighted regression.
Log2+ Mean normalization: for the expression values in the expression matrix, log2 conversion is first performed, and for each protein expression value of each sample in the converted matrix, the average value of all protein expression values of the sample is divided by the average value of all protein expression values of the sample, and then the average value of all protein expression values of all samples is multiplied.
VSN normalization: since the operation of the VSN normalization is similar to Log2 transformation, log2 transformation is not needed first, and variance stabilization normalization is directly performed on the expression values in the expression matrix.
Log2+rlr normalization: for the expression values in the expression matrix, log2 conversion is performed first, and then the converted expression matrix is normalized by adopting robust linear regression.
Log2+ GI normalization: for the expression values in the expression matrix, log2 conversion is first performed, the expression magnitude of each protein for each sample in the converted matrix is divided by the sum of all protein expression magnitudes for that sample, and then the median of the sum of all protein expression magnitudes in all samples is multiplied.
And (3) carrying out Log2+Quantile standardization, namely firstly carrying out Log2 conversion on the expression values in the expression matrix, then sequencing each column of the expression matrix independently, averaging the sequenced matrix to obtain an average value vector, and then replacing the corresponding average value according to the sequencing condition of the original matrix.
GA-RF: a Random Forest model based on a genetic algorithm, namely, a genetic algorithm (GA, genetic Algorithm) is adopted to carry out optimization and parameter adjustment on a Random Forest (RF) model.
True yang: the number of positive samples is predicted correctly, the actual positive samples, and the prediction is also positive samples.
True yin: the number of negative examples is predicted correctly, the actual negative examples, and the prediction is also negative examples.
Pseudo-yang: the number of positive samples is mispredicted, actually negative samples, but predicted for positive samples.
Pseudo-yin: the number of negative samples is mispredicted, actually positive samples, but predicted for negative samples.
Sensitivity: also called recall, true positive, i.e. correctly predicting the number of positive samples/the total number of actual positive samples, i.e. (true positive)/(true positive + false negative).
Specificity: also called true negative rate, i.e. the number of correctly predicted negative samples/the total number of actual negative samples, i.e. (true negative)/(true negative + false positive).
Accuracy: also called positive predictive value, the number of positive samples/the total number of positive samples is predicted correctly, i.e., (true positive)/(true positive + false positive).
Accuracy rate: the number of positive and negative samples/total number of samples, namely (true positive+true negative)/(true positive+true negative+false positive+false negative), are correctly predicted.
F1 value: f1 =2 x (accuracy x recall)/(accuracy+recall).
False positive rate: i.e. the number of mispredicted positive samples/the total number of actual negative samples, is equal to (1-specificity), i.e. (pseudo-positive)/(true negative + pseudo-positive).
ROC curve and AUC values: is a criterion for measuring the quality of the classifier, wherein the ROC (Receiver Operating Characteristic) curve, i.e. the receiver operation characteristic curve, has a false positive rate on the horizontal axis and a true positive rate on the vertical axis, and the area under the ROC curve is the AUC value when the plotted curve is above the y=x line. The greater the AUC, the better the classifier (e.g., logistic regression model) classification performance.
In one aspect, the present invention provides a set of serum protein biomarkers for gastric cancer screening, which are particularly suitable for screening for early gastric cancer, while providing for constructing a logistic regression prediction model based on the serum protein biomarkers described above, which can be used to identify whether a sample is derived from a gastric cancer patient.
The serum protein biomarkers for gastric cancer screening are CA-II, HBG1, HBA1, NKEF-A, HBD, HBB, C1Inh, IGLV8-61, CA-I, PAMP, CAT and TN-X, wherein the 4 proteins LRG, HP2, PDE4DIP and SAA are proteins that are up-regulated in gastric cancer patient samples, and the 12 proteins CA-II, HBG1, HBA1, NKEF-A, HBD, HBB, C1Inh, IGLV8-61, CA-I, PAMP, CAT and TN-X are proteins that are down-regulated in gastric cancer patient samples.
Preferably, the serum protein biomarker for gastric cancer screening at least comprises any one or more than two serum protein biomarkers selected from CA-II, CAT, LRG, HBG 1.
In one aspect, the invention also provides a serum protein biomarker-based detection method using a combination of CA-II, CAT, LRG and HBG1 serum protein biomarkers for gastric cancer screening, comprising the steps of:
(1) Determining the protein content of the CA-II, CAT, LRG and HBG1 serum protein biomarkers, respectively, for the sample;
(2) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of the two serum protein biomarkers of CA-II and CAT, and filling with a K nearest neighbor method for the absence of the protein content of the two serum protein biomarkers of LRG and HBG 1;
(3) And (3) carrying out data standardization processing on the protein content data obtained in the step (2) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
a) For Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-52.012-2.063 ca-II-0.191 cat +10.228 lrg-5.335 hbg 1)));
b) For Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-125.783-1.230 ca-II-0.155 cat+14.863 lrg-6.767 hbg 1)));
c) For Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-167.929-1.723 x ca-II-0.250 x cat+17.763 x lrg-7.100 x hbg 1)));
d) For Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-125.397-1.231 ca-II-0.157 cat+14.830 lrg-6.757 hbg 1)));
e) For Log2+Median normalization, the probability prediction formula of the set of markers is as follows
P=1/(1+Exp(-(-146.727-0.874*CA-II-0.155*CAT+18.809*LRG-9.650*HBG1)));
f) For Log2+Quantile normalization, the probability prediction formula of the set of markers is as follows
P=1/(1+Exp(-(-22.228-2.111*CA-II-0.427*CAT+11.865*LRG-7.980*HBG1)));
g) For Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (-100.729-2.142 ca-II-0.181 cat+13.524 lrg-6.047 hbg 1)));
h) For VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-140.903-4.207 x ca-ii+0.408 x cat+23.456 x lrg-12.042 x hbg 1))).
A method of detecting a serum protein biomarker for gastric cancer screening using a combination of three serum protein biomarkers of CA-II, CAT, LRG, comprising the steps of:
(1) Separately determining the protein content of the CA-II, CAT, LRG serum protein biomarker in the sample;
(2) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of the two proteins CA-II and CAT and filling with a K neighbor method for the absence of the protein content of LRG;
(3) And (3) carrying out data standardization processing on the protein content data obtained in the step (2) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
a) For the Log2+CycLoess normalization, the probability prediction formula of the group of markers is as follows
P=1/(1+Exp(-(-56.039-1.591*CA-II-0.420*CAT+4.280*LRG)));
b) For Log2+ GI normalization, the probability prediction formula for the set of markers is p=1/(1 + exp (-84.164-1.791 ca-II-0.466 cat +5.880 lrg)));
c) For Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-92.539-1.489 ca-II-0.489 cat+6.075 lrg)));
d) For Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-84.108-1.793 x ca-II-0.466 x cat +5.877 x lrg)));
e) For Log2+Median normalization, the probability prediction formula of the set of markers is as follows
P=1/(1+Exp(-(-71.069-1.935*CA-II-0.481*CAT+5.345*LRG)));
f) For Log2+Quantile normalization, the probability prediction formula of the set of markers is as follows
P=1/(1+Exp(-(-62.786-1.142*CA-II-0.494*CAT+4.311*LRG)));
g) For Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-71.148-2.369 x ca-II-0.483 x cat+5.687 x lrg)));
h) For VSN normalization, the probability prediction formula for the set of markers is
P=1/(1+Exp(-(-38.994-2.505*CA-II-0.545*CAT+4.318*LRG)))。
In one example, the invention also provides a serum protein biomarker-based detection method using a combination of CA-II and CAT serum protein biomarkers for gastric cancer screening, comprising the steps of:
(1) Determining the protein content of the CA-II and CAT serum protein biomarkers, respectively, for the sample;
(2) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant 1.0 for the absence of protein content of both CA-II and CAT proteins;
(3) And (3) carrying out data standardization processing on the protein content data obtained in the step (2) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
a) For Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (10.807-0.510 x ca-II-0.314 x cat)));
b) For Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (9.112-0.402 x ca-II-0.311 x cat)));
c) For Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (9.845-0.449 x ca-II-0.309 x cat)));
d) For Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (9.116-0.402 x ca-II-0.311 x cat)));
e) For Log2+ Median normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (9.309-0.413 ca-II-0.312 cat)));
f) For Log2+ Quantile normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (10.104-0.443 ca-II-0.336 cat)));
g) For Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (11.356-0.538 ca-II-0.318 cat)));
h) For VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (44.877-1.990 ca-II-0.941 cat))).
In one example, the invention also provides a serum protein biomarker-based detection method using CAT and LRG serum protein biomarkers in combination for gastric cancer screening, comprising the steps of:
(1) Determining the protein content of the CAT and LRG serum protein biomarkers, respectively, for the sample;
(2) Filling up the missing value: if the content of one or more serum protein biomarkers is missing in the measurement result, filling the protein content of CAT with a constant of 1.0, and filling the protein content of LRG with a K nearest neighbor method;
(3) And (3) carrying out data standardization processing on the protein content data obtained in the step (2) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
a) For Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-56.039-0.420×cat+4.280×lrg)));
b) For Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-84.164-0.466 cat+5.880 lrg)));
c) For Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-92.539-0.489×cat+6.075×lrg)));
d) For Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-84.108-0.466 x cat+5.877 x lrg)));
e) For Log2+ Median normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-71.069-0.481×cat+5.345×lrg)));
f) For Log2+ Quantile normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-62.786-0.494 x cat+4.311 x lrg)));
g) For Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-71.148-0.483×cat+5.687×lrg)));
h) For VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-38.994-0.545×cat+4.318×lrg))).
In one example, the invention also provides a serum protein biomarker-based detection method using a combination of CA-II and LRG serum protein biomarkers for gastric cancer screening, comprising the steps of:
(1) Determining the protein content of the CA-II and LRG serum protein biomarkers, respectively, for the sample;
(2) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of CA-II and filling with a K-nearest neighbor method for the absence of the protein content of LRG;
(3) And (3) carrying out data standardization processing on the protein content data obtained in the step (2) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
a) For Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-64.187-1.124 x ca-II +4.063 x lrg)));
b) For Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-73.508-1.365 ca-II +4.720 lrg)));
c) For Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-69.978-1.574 x ca-ii+4.702 x lrg)));
d) For Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-73.034-1.379 x ca-II +4.704 x lrg)));
e) For Log2+ Median normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-65.803-1.325 x ca-II +4.297 x lrg)));
f) For Log2+ Quantile normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-59.740-0.628 x ca-II +3.446 x lrg)));
g) For Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-70.260-1.473 ca-ii+4.637 lrg)));
h) For VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-48.060-2.711 ca-ii+4.534 lrg))).
In one aspect, the invention provides a method of determining whether a sample is derived from a gastric cancer patient.
For convenience of description, a schematic diagram of a method of determining whether a sample is derived from a gastric cancer patient according to the present invention is shown in fig. 1. As shown in fig. 1, the method includes:
(1) Determination of serum protein biomarker content: measuring the content of the related protein in the sample so as to obtain the expression quantity information of the related protein biomarker of the sample;
(2) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of the two proteins CA-II and CAT, and filling with a K neighbor method for other proteins except the two proteins;
(3) Data normalization: performing one of eight normalization methods (Log 2, log2+Median, log2+ Mean, VSN, log2+RLR, log2+GI, log2+Quantile and Log2+CycLoess) on the protein expression amount data;
(4) Probability value calculation of samples derived from gastric cancer patients: and (3) determining the probability value of the sample to be detected from the gastric cancer patient based on the probability prediction formula of the standardized data obtained in the step (2) and the corresponding standardized method. According to the method provided by the invention, whether the sample to be tested is derived from a gastric cancer patient can be judged with high specificity and high sensitivity only by the prediction model provided by the invention without depending on the sample information of the patient and only based on the content of the related protein biomarker of the sample to be tested, so that the gastric cancer screening result is prompted.
The invention is further illustrated below with reference to examples.
Examples
The described embodiments of the present invention should not be construed as limitations of the present invention, and all other embodiments that would be obvious to one of ordinary skill in the art, or to an analyst without the benefit of the inventive faculty, are intended to be within the scope of the present invention. The reference drawings are only exemplary for purposes of illustrating the invention and are not to be construed as limiting the invention.
Example 1 screening of protein biomarkers
For convenience of description, a schematic diagram of the protein biomarker screening procedure according to the present invention is shown in fig. 2:
mass spectrometry and Swiss-Prot database retrieval:
by performing mass spectrometry and further performing a Swiss-Prot database search, we obtained a protein expression profile for samples of known origin. The known source of samples is comprised of a known number of samples of gastric benign disease and a known number of samples of gastric cancer. The method comprises the following steps:
sample source
40 serum samples of non-gastric cancer population and 40 serum samples of gastric cancer stage I-IV were obtained from Shanghai Zhongshan Hospital as samples of the present invention. The collection of the samples followed the ethical standard made by the ethical committee of the Shanghai Zhongshan Hospital and signed an informed opinion notice.
Sample processing
Taking the above 80 samples as high-abundance and high-abundance mixed samples and original samples respectively, taking appropriate amount of protein for SDS-PAGE electrophoresis test, and evaluating the consistency among the samples. The purpose of this is to exclude samples that may be contaminated. After evaluation, none of the samples was considered contaminated.
Protein reductive alkylation and enzyme digestion: adding 35 μl UA buffer (8M Urea,150mM Tris-HCl, pH 8.0), mixing, adding DTT to a final concentration of 20mM, reacting at 37deg.C for 2 hr, recovering room temperature, adding IAA to a final concentration of 25mM (50 mM IAA in UA), oscillating at 600rpm for 1min, and keeping away from light for 30min at room temperature; 150. Mu.L of NH4HCO3 buffer (50 mM) was added, followed by 2. Mu.g of Lys-C to the sample for 4h, and finally 4. Mu.g of Trypsin was added and incubated at 37℃for 16h.
The peptide concentration was determined by desalting on a C18 column at OD 280. Then, 2. Mu.g of each sample was taken out and mixed with an appropriate amount of iRT standard peptide for detection by the LC-MS/MS DDA method and the LC-MS/MS DIA method.
Mass spectrometry analysis
And analyzing the protein contained in the sample by adopting a nano liter liquid chromatography-Q-exact HF mass spectrum.
First, chromatographic separation was performed using a nano-liter flow HPLC system Easy nLC-1200. Buffer solution: the solution A is 0.1% formic acid aqueous solution, and the solution B is 0.1% formic acid acetonitrile aqueous solution. The column was equilibrated with 95% solution a. And (3) carrying out gradient separation on the sample by using a chromatographic Column of 50cm tip-Column after the sample is injected into the Trap Column, wherein the flow rate is 250nL/min. The liquid phase separation gradient is as follows: 0-80 minutes, linear gradient of liquid B from 8% to 30%; 70-100 minutes, the linear gradient of the liquid B is from 30% to 100%; 80-120 minutes, the linear gradient of liquid B was raised to 100% and maintained.
The chromatographed samples were then analyzed by DDA scanning using a Q-exact HF mass spectrometer (Thermo Scientific). Ion mode: ESI positive ions. Primary mass spectrum scan range: 300-1800m/z, mass spectrum resolution: 60,000 (@ m/z 200), AGC target:3e6, maximum IT:50ms. 20 ddMS2 scans (MS 2 scans) were acquired according to the Inclusion list after each level MS scan (full MS scan), isolation window:1.6Th, mass spectral resolution: 30,000 (@ m/z 200), AGC target:3e6, maximum IT:120ms,MS2 Activation Type: HCD, normalized collision energy:27.
the chromatographically separated samples were analyzed by mass spectrometry DIA scanning. Ionization mode: positive ions. Primary mass spectrum scan range: 350-1650m/z, mass spectrum resolution: 120,000 (@ m/z 200), AGC target:3e6, maximum IT:50ms. MS2 adopts DIA data acquisition mode, sets 30 DIA acquisition windows, and mass spectrum resolution ratio: 30,000 (@ m/z 200), AGC target:3e6, maximum IT: auto, MS2 Activity Type: HCD, normalized collision energy:30,Spectral data type:profile.
Swiss-Prot database retrieval
And (3) carrying out Swiss-Prot database retrieval on the obtained mass spectrum data by using Maxquant software (Maxquant_ 1.5.3.17), and identifying proteins by using other parameters by using software default parameters to obtain a protein expression quantity matrix.
(II) treatment of protein with abnormal expression level: knocking out proteins whose expression levels are abnormal, such as proteins whose expression levels are absent in all samples or proteins whose expression levels are present in only one sample; in this example, two abnormal proteins were removed.
(III) protein filtration based on application availability: because of the limitations of the technology, the quantitative value of the protein biomarker may be randomly lost, and in order to improve the application effectiveness, proteins with quantitative values in more than 90% of gastric benign samples or more than 90% of gastric cancer samples are selected for subsequent protein biomarker screening;
(IV) identification of missing values: identifying whether the deletion value in the protein expression matrix is due to a random deletion or a non-random deletion;
filling of missing values: filling a protein quantitative value which is not randomly deleted by adopting a constant of 1.0, and calculating and filling the protein quantitative value which is randomly deleted by adopting a K-approach method;
(six) differential expression protein analysis: and (3) carrying out t-test-based differential expression protein analysis on the expression quantity of the protein biomarker to be screened of the gastric disease sample treated in the five steps and the expression quantity of the protein biomarker to be screened of the gastric cancer sample (the results are shown in table 1 and fig. 3), thereby obtaining a first primary screened protein biomarker set (1). The collection contained a total of 12 proteins, of which 4 proteins, LRG, HP2, PDE4DIP and SAA, were up-regulated in the gastric cancer patient samples, while 12 proteins, CA-II, HBG1, HBA1, NKEF-A, HBD, HBB, C Inh, IGLV8-61, CA-I, PAMP, CAT and TN-X, were down-regulated in the gastric cancer patient samples. Firstly, carrying out kmeans clustering on the differential proteins according to k=2, then combining protein data, and then carrying out clustering on samples to obtain a result diagram, wherein most gastric cancer samples and benign gastric disease samples can be correctly clustered by the differential proteins as shown in a clustering heat diagram of the differential expression proteins based on t-test in FIG. 4.
TABLE 1 analysis of t-test based differentially expressed proteins
Protein name Protein ID P-value LogFC Change
CA-II P00918 1.23E-07 -1.32 DOWN
HBG1 P69891 1.46E-06 -1.16 DOWN
HBA1 P69905 6.21E-06 -0.99 DOWN
NKEF-A Q06830 3.40E-05 -0.75 DOWN
HBD P02042 3.57E-05 -1.03 DOWN
LRG P02750 9.79E-05 0.64 UP
HBB P68871 1.02E-04 -1.14 DOWN
ClInh P05155 1.05E-04 -0.88 DOWN
IGLV8-61 A0A075B6I0 7.91E-04 -0.73 DOWN
CA-I P00915 1.36E-03 -0.80 DOWN
HP2 P59665 1.75E-03 0.65 UP
PAMP Q96TA2 2.19E-03 -0.89 DOWN
PDE4DIP Q5VU43 4.41E-03 0.63 UP
CAT P04040 9.94E-03 -2.43 DOWN
SAA P0DJI8 1.57E-02 2.09 UP
TN-X P22105 2.23E-02 -0.67 DOWN
(seventh) data normalization: according to preliminary analysis, the inventor finds that the combinations of protein biomarkers obtained by screening by different standardization methods may have larger difference, and considers the effectiveness of practical application, the inventor adopts eight standardization methods (Log 2, log2+Median, log2+ Mean, VSN, log2+RLR, log2+GI, log2+Quantile and Log 2+CycLoess) to standardize the protein expression matrixes processed from the step (one) to the step (five), and respectively carries out protein biomarker screening of random forest models based on genetic algorithms on all eight standardized protein expression matrixes;
(eight) scoring and screening protein biomarkers to be screened based on a random forest model (GA-RF) of genetic algorithm: the genetic algorithm has the characteristic of searching a heuristic algorithm, namely, the protein biomarkers with stronger classification effect have more opportunities to be evaluated, so that the stability of random forest evaluation results is enhanced; meanwhile, the mutation process of the genetic algorithm also enables protein biomarkers with weaker classification to obtain the opportunity to be evaluated. To ensure stability of the selection of protein biomarkers, screening of protein biomarkers was performed by 150 random resampling and extraction of different protein subsets, and validation was performed using 10 fold cross validation repeated 5 times for each resampling process. The 5 proteins with the highest frequency of occurrence during all resampling were finally selected. The protein expression matrix was normalized by Log2+CycLoess and then GA-RF screened to give a pool of 5 proteins CAT, LRG, CA-II, STXBP5 and C9 (2). The protein expression matrix was normalized by Log2+GI and then GA-RF screened to give a collection of 5 proteins CAT, B2GPI, HBG1, IGLC3 and CA-II (3). After normalization of the protein expression matrix to Log2, the collection of 5 proteins CAT, CA-II, C9, LRG and IGLC3 was obtained by GA-RF screening (4). The protein expression matrix was normalized by Log2+Mean and then GA-RF screened to give a pool of 5 protein compositions CAT, CA-II, LRG, HBG1 and CA-I (5). The protein expression matrix was normalized by Log2+Median and then GA-RF screened to give a collection of 5 protein compositions CAT, ITI-HC4, CA-II, LRG and HBG1 (6). The protein expression matrix was normalized by Log2+Quantile and then GA-RF screened to give a collection of 5 proteins CAT, IGLC3, CA-II, CLH-17 and B2GPI (7). The protein expression matrix was normalized by Log2+RLR and then GA-RF screened to give a set of 5 proteins CAT, CA-II, LRG, B2GPI and HBD (8). The protein expression matrix was normalized by VSN and then GA-RF screened to give a pool of 5 proteins, LRG, CA-II, IGLC3, HBG1 and APOC-III (9).
(nine) counting the frequency of protein occurrences in all pools: further, in order to screen the most discriminative, central and important protein biomarkers for gastric cancer, the inventors of the present invention counted the frequency of occurrence of proteins involved in the union of the above 10 sets in all 10 sets, and ranked the top 5 proteins and their frequency of occurrence in order of frequency from high to low: CA-II (9 times), CAT (8 times), LRG (7 times), HBG1 (5 times), IGLC3 (4 times);
(ten) screening proteins according to frequency: preferably, the proteins with frequency of occurrence of 5 or more in step (nine), namely CA-II, CAT, LRG and HBG1, are used for subsequent logistic regression modeling.
Further, the identifying of the missing value in the step (four) and the identifying of the filling of the missing value in the step (five) specifically include:
a) Constructing a missing value matrix: the filling of the positions of the protein qualitative and quantitative information data matrix in the protein expression matrix is 1, and the filling of the positions of the protein qualitative and quantitative information data matrix in the absence is 0.
b) Constructing a phenotype data sequence: all benign samples were defined as 0 and all gastric samples were defined as 1.
c) And calculating Pearson correlation coefficients of the data sequence formed by the values of each protein in all samples and the phenotype data sequence formed by all samples in the deletion value matrix. For the protein with correlation coefficient value > =0.4 obtained by calculation, whether the quantitative value of the protein is empty depends on the phenotype of the sample, and two proteins which meet the conditions are CA-II and CAT, and the Pearson correlation coefficients of the data sequences and the phenotype data sequences are respectively 0.55 and 0.52.
d) For protein CA-II, the quantitative signal intensity value of 22 samples out of 40 samples of gastric cancer is missing, while the quantitative signal intensity value of only 2 samples out of 40 samples of gastric disease is missing, so the quantitative signal intensity loss of protein CA-II mainly occurs in gastric cancer samples. For protein CAT, the quantitative signal intensity value of 24 samples out of 40 samples of gastric cancer is lost, while the quantitative signal intensity value of only 4 samples out of 40 samples of gastric benign disease is lost, so the quantitative signal intensity loss of protein CAT also mainly occurs in gastric cancer groups. The t-test for the deletion of the gastric and gastric groups of proteins CA-II showed that there was a significant difference in the deletion of the gastric and gastric groups of CA-II, p-value was 2.37E-07 (p-value < 0.01). For the protein CAT, t-test of the deletion of gastric cancer group and gastric disease group showed that there was also a significant difference in deletion of gastric cancer group and gastric disease group of CAT, p-value was 7.64E-06 (p-value < 0.01). Since the absence or absence of the quantitative signal intensity values of the two proteins depends on the phenotype of the sample and the absence mainly occurs in the gastric cancer group, the absence of the quantitative signal intensity for the gastric cancer group of the two proteins is regarded as a non-random absence, and the absence of the quantitative signal intensity for the gastric disease group is regarded as a random absence.
e) Proteins other than the two CA-II and CAT proteins did not distinguish between the benign gastric and gastric groups, and any loss of quantitative signal intensity in all samples was considered to be random.
Example 2 logistic regression predictionModel creation and verification
(1) Performing mass spectrometry-based proteome content determination and Swiss-Prot search on samples of known sources consisting of a known number of samples of gastric benign disease and a known number of samples of gastric cancer to obtain a protein expression amount matrix of the samples of known sources;
(2) The protein expression matrix is standardized by adopting eight standardized methods (Log 2, log2+Median, log2+ Mean, VSN, log2+RLR, log2+GI, log2+Quantile and Log2+CycLoess);
(3) Dividing all samples into a training set and a testing set according to the ratio of 8:2;
(4) And respectively carrying out training of a logistic regression model, 5-fold cross verification and self-service internal verification on the protein expression matrix obtained by processing by eight standardized methods by using training set sample data, wherein the method comprises the following steps of:
[1] model construction was performed on the training set using four protein biomarkers CA-II, CAT, LRG and HBG1, resulting in 8 logistic regression prediction models containing four factors. The self-help method tests the stability and effect of the training model by carrying out 1000 times of random sampling with put-back on the training set, and the average value and standard deviation of AUC of the eight standardized methods are respectively as follows: AUC of the training model based on Log2+ CycLoess normalization is 0.995±0.005; AUC of training model based on Log2+ GI normalization was 0.998±0.003; AUC of training model based on Log2 normalization was 0.998±0.003; AUC of training model based on Log2+ Mean normalization was 0.998±0.003; AUC of the training model based on Log2+ Median normalization was 0.999±0.002; AUC of the training model based on Log2+ quantille normalization was 0.995±0.005; AUC of training model based on Log2+rlr normalization is 0.997±0.004; AUC based on the training model for VSN normalization was 0.998±0.003. The results of the model cross-validation are shown in fig. 5. The probability prediction formulas obtained by logistic regression modeling of the protein biomarkers CA-II, CAT, LRG and HBG1 after different normalization methods are shown in table 2.
TABLE 2 probability prediction formulas obtained by logistic regression modeling of protein biomarkers CA-II, CAT, LRG and HBG1 after treatment by different normalization methods
Figure BDA0003343682160000211
[2] And performing model construction on the training set by using three protein biomarkers CA-II, CAT, LRG to obtain 8 logistic regression prediction models containing three factors. The self-help method tests the stability and effect of the training model by carrying out 1000 times of random sampling with put-back on the training set, and the average value and standard deviation of AUC of the eight standardized methods are respectively as follows: AUC of the training model based on Log2+ CycLoess normalization is 0.97±0.02; AUC of training model based on Log2+ GI normalization is 0.98±0.01; AUC of the training model based on Log2 normalization was 0.98±0.01; AUC of training model based on Log2+ Mean normalization is 0.98±0.01; AUC of training model based on Log2+ Median normalization is 0.98±0.02; AUC of the training model based on Log2+ quantille normalization was 0.98±0.02; AUC of training model based on Log2+rlr normalization is 0.98±0.01; AUC based on the training model for VSN normalization was 0.95±0.03. The results of the model cross-validation are shown in fig. 6. The probability prediction formulas obtained by logistic regression modeling of protein biomarkers CA-II, CAT and LRG after treatment by different normalization methods are shown in Table 3.
TABLE 3 probability prediction formulas obtained by logistic regression modeling of protein biomarkers CA-II, CAT and LRG after treatment by different normalization methods
Figure BDA0003343682160000221
[3] And performing model construction on the training set by using the CAT protein biomarker and the LRG protein biomarker to obtain 8 logistic regression prediction models containing two factors. The self-help method tests the stability and effect of the training model by carrying out 1000 times of random sampling with put-back on the training set, and the average value and standard deviation of AUC of the eight standardized methods are respectively as follows: AUC of the training model based on Log2+ CycLoess normalization is 0.96±0.02; AUC of training model based on Log2+ GI normalization is 0.97±0.02; AUC of the training model based on Log2 normalization was 0.97±0.02; AUC of training model based on Log2+ Mean normalization was 0.97±0.02; AUC of training model based on Log2+ Median normalization is 0.97±0.02; AUC of the training model based on Log2+ quantille normalization was 0.96±0.02; AUC of training model based on Log2+rlr normalization is 0.96±0.02; AUC based on the training model for VSN normalization was 0.90±0.04. The results of the model cross-validation are shown in fig. 7. The probability prediction formulas obtained by logistic regression modeling of protein biomarkers CAT and LRG after treatment by different normalization methods are shown in Table 4.
TABLE 4 probability prediction formulas obtained by logistic regression modeling of protein biomarkers CAT and LRG after treatment by different normalization methods
Figure BDA0003343682160000231
/>
[4] And performing model construction on the training set by using the two protein biomarkers CA-II and CAT to obtain 8 logistic regression prediction models containing two factors. The self-help method tests the stability and effect of the training model by carrying out 1000 times of random sampling on the training set, and the average value and standard deviation of AUC (automatic user Equipment) of the eight standardized methods are respectively as follows: AUC of the training model based on Log2+ CycLoess normalization is 0.87±0.05; AUC of training model based on Log2+ GI normalization is 0.88±0.05; AUC of the training model based on Log2 normalization was 0.88±0.05; AUC of training model based on Log2+ Mean normalization was 0.88±0.05; AUC of training model based on Log2+ Median normalization is 0.88±0.05; AUC of the training model based on Log2+ quantille normalization was 0.88±0.05; AUC of training model based on Log2+rlr normalization is 0.88±0.05; AUC based on the training model for VSN normalization was 0.86±0.05. The results of the model cross-validation are shown in fig. 8. The probability prediction formulas obtained by logistic regression modeling of the protein biomarkers CA-II and CAT after treatment by different standardized methods are shown in Table 5.
TABLE 5 probability prediction formulas obtained by logistic regression modeling of protein biomarkers CA-II and CAT after treatment by different normalization methods
Figure BDA0003343682160000232
[5] And performing model construction on the training set by using the two protein biomarkers CA-II and LRG to obtain 8 logistic regression prediction models containing two factors. The self-help method tests the stability and effect of the training model by carrying out 1000 times of random sampling with put-back on the training set, and the average value and standard deviation of AUC of the eight standardized methods are respectively as follows: AUC of the training model based on Log2+ CycLoess normalization is 0.95±0.03; AUC of training model based on Log2+ GI normalization is 0.95±0.03; AUC of the training model based on Log2 normalization was 0.95±0.03; AUC of training model based on Log2+ Mean normalization was 0.95±0.03; AUC of training model based on Log2+ Median normalization is 0.94±0.03; AUC of the training model based on Log2+ quantille normalization was 0.93±0.04; AUC of training model based on Log2+rlr normalization is 0.95±0.03; AUC based on the training model for VSN normalization was 0.95±0.03. The results of the model cross-validation are shown in fig. 9. The probability prediction formulas obtained by logistic regression modeling of protein biomarkers CA-II and LRG after treatment by different normalization methods are shown in Table 6.
TABLE 6 probability prediction formulas for protein biomarkers CA-II and LRG after treatment by different normalization methods and modeling by logistic regression
Figure BDA0003343682160000241
(5) And evaluating the prediction effect of the model obtained by training by using a test set sample, wherein the method comprises the following steps of:
[1] the prediction results of eight logistic regression prediction models constructed under eight standardized methods using four protein biomarkers CA-II, CAT, LRG and HBG1 on the test set are shown in table 7, and ROC curves are shown in fig. 10.
TABLE 7 prediction effects of protein biomarkers CA-II, CAT, LRG and HBG1 on test sets after logistic regression modeling after treatment with different normalization methods
Figure BDA0003343682160000251
[2] The prediction results of eight logistic regression prediction models constructed under eight standardized methods using three protein biomarkers CA-II, CAT, LRG on the test set are shown in table 8, and the ROC curves are shown in fig. 11.
TABLE 8 prediction effects of protein biomarkers CA-II, CAT and LRG on test sets after logistic regression modeling after treatment with different normalization methods
Figure BDA0003343682160000252
[3] The prediction results of eight logistic regression prediction models constructed under eight standardized methods using two protein biomarkers CAT and LRG on the test set are shown in table 9, and ROC curves are shown in fig. 12.
TABLE 9 predictive efficacy of protein biomarkers CAT and LRG in test sets after logistic regression modeling after treatment with different normalization methods
Figure BDA0003343682160000253
[4] The prediction results of eight logistic regression prediction models constructed under eight standardized methods using two protein biomarkers CA-II and CAT are shown in table 10, and ROC curves are shown in fig. 13.
TABLE 10 prediction effects of protein biomarkers CA-II and CAT on test sets after logistic regression modeling after treatment with different normalization methods
Figure BDA0003343682160000261
[5] The prediction results of eight logistic regression prediction models constructed under eight standardized methods using two protein biomarkers CA-II and LRG on the test set are shown in table 11, and ROC curves are shown in fig. 14.
TABLE 11 prediction effects of protein biomarkers CA-II and LRG on test sets after logistic regression modeling after treatment with different normalization methods
Figure BDA0003343682160000262
/>

Claims (12)

1. Serum protein biomarker for gastric cancer screening, characterized in that the serum protein biomarker is one or more selected from the group consisting of CA-II, HBG1, HBA1, NKEF-A, HBD, HBB, C1Inh, IGLV8-61, CA-I, PAMP, CAT and TN-X.
2. The serum protein biomarker for gastric cancer screening according to claim 1, characterized in that the serum protein biomarker is one or more selected from CA-II, CAT, LRG, HBG 1.
3. The serum protein biomarker for gastric cancer screening according to claim 2, characterized in that the serum protein biomarkers are CA-II, CAT, LRG and HBG1.
4. The serum protein biomarker for gastric cancer screening according to claim 2, characterized in that the serum protein biomarker is CA-II, CAT and LRG.
5. The serum protein biomarker for gastric cancer screening according to claim 2, characterized in that the serum protein biomarkers are CA-II and CAT.
6. The serum protein biomarker for gastric cancer screening of claim 2, wherein the serum protein biomarker is CAT and LRG.
7. The serum protein biomarker for gastric cancer screening according to claim 2, characterized in that the serum protein biomarker is CA-II and LRG.
8. A method of detection using the serum protein biomarker for gastric cancer screening of claim 3, comprising the steps of:
a) Determining the protein content of the CA-II, CAT, LRG and HBG1 serum protein biomarkers, respectively, for the sample;
b) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of the two proteins CA-II and CAT, and filling with a K neighbor method for the absence of the protein content of the two proteins LRG and HBG 1;
c) And (c) carrying out data standardization processing on the protein content data obtained in the step (b) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
i. for Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-52.012-2.063 ca-II-0.191 cat +10.228 lrg-5.335 hbg 1)));
for Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-125.783-1.230 ca-II-0.155 cat+14.863 lrg-6.767 hbg 1)));
For Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (-167.929-1.723 ca-II-0.250 cat+17.763 lrg-7.100 hbg 1)));
for Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-125.397-1.231 ca-II-0.157 cat+14.830 lrg-6.757 hbg 1)));
for Log2+ Median normalization, the probability prediction formula of the set of markers is p=1/(1 + exp (-146.727-0.874 ca-II-0.155 cat+18.809 lrg-9.650 hbg 1)));
for Log2+ Quantile normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (-22.228-2.111 ca-II-0.427 cat+11.865 lrg-7.980 hbg 1)));
for Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-100.729-2.142 x ca-II-0.181 x cat+13.524 x lrg-6.047 x hbg 1)));
for VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-140.903-4.207 ca-ii+0.408 cat+23.456 lrg-12.042 hbg 1))).
9. A detection method using the serum protein biomarker for gastric cancer screening according to claim 4, comprising the steps of:
a) Determining the protein content of the CA-II, CAT and LRG serum protein biomarkers, respectively, for the sample;
b) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of the two proteins CA-II and CAT and filling with a K neighbor method for the absence of the protein content of LRG;
c) And (c) carrying out data standardization processing on the protein content data obtained in the step (b) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
i. for Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-56.039-1.591 x ca-II-0.420 x cat +4.280 x lrg)));
for Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-84.164-1.791 x ca-II-0.466 x cat+5.880 x lrg)));
for Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-92.539-1.489 ca-II-0.489 cat+6.075 lrg)));
For Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-84.108-1.793 x ca-II-0.466 x cat+5.877 x lrg)));
for Log2+ Median normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-71.069-1.935 x ca-II-0.481 x cat+5.345 x lrg)));
for Log2+ Quantile normalization, the probability prediction formula of this set of markers is p=1/(1 + exp (- (-62.786-1.142 x ca-II-0.494 x cat+4.311 x lrg)));
for Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-71.148-2.369 x ca-II-0.483 x cat+5.687 x lrg)));
for VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-38.994-2.505 x ca-II-0.545 x cat+4.318 x lrg))).
10. A detection method using the serum protein biomarker for gastric cancer screening according to claim 5, comprising the steps of:
a) Determining the protein content of the CA-II and CAT serum protein biomarkers, respectively, for the sample;
b) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant 1.0 for the absence of protein content of both CA-II and CAT proteins;
c) And (c) carrying out data standardization processing on the protein content data obtained in the step (b) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
i. for Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (10.807-0.510 x ca-II-0.314 x cat)));
for Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (9.112-0.402 x ca-II-0.311 x cat)));
for Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (9.845-0.449 x ca-II-0.309 x cat)));
for Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (9.116-0.402 x ca-II-0.311 x cat)));
for Log2+ Median normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (9.309-0.413 x ca-II-0.312 x cat)));
For Log2+ Quantile normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (10.104-0.443 ca-II-0.336 cat)));
for Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (11.356-0.538 ca-II-0.318 cat)));
for VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (44.877-1.990 ca-II-0.941 cat))).
11. A detection method using the serum protein biomarker for gastric cancer screening according to claim 6, comprising the steps of:
a) Determining the protein content of the CAT and LRG serum protein biomarkers, respectively, for the sample;
b) Filling up the missing value: if the content of one or more serum protein biomarkers is missing in the measurement result, filling the protein content of CAT with a constant of 1.0, and filling the protein content of LRG with a K nearest neighbor method;
c) And (c) carrying out data standardization processing on the protein content data obtained in the step (b) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
i. For Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-56.039-0.420×cat+4.280×lrg)));
for Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-84.164-0.466 x cat +5.880 x lrg)));
for Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-92.539-0.489×cat+6.075×lrg)));
for Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-84.108-0.466 x cat +5.877 x lrg)));
for Log2+ Median normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-71.069-0.481 x cat +5.345 x lrg)));
for Log2+ Quantile normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-62.786-0.494 x cat+4.311 x lrg)));
for Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-71.148-0.483×cat+5.687×lrg)));
for VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-38.994-0.545×cat+4.318×lrg))).
12. A detection method using the serum protein biomarker for gastric cancer screening according to claim 7, comprising the steps of:
a) Determining the protein content of the CA-II and LRG serum protein biomarkers, respectively, for the sample;
b) Filling up the missing value: if the content of one or more serum protein biomarkers is absent in the measurement result, filling with a constant of 1.0 for the absence of the protein content of CA-II and filling with a K-nearest neighbor method for the absence of the protein content of LRG;
c) And (c) carrying out data standardization processing on the protein content data obtained in the step (b) by adopting one of eight standardization methods provided by the invention, and calculating a prediction probability P value by using a probability prediction formula corresponding to the standardization method provided by the invention. When the P value is more than or equal to 0.5, judging that the gastric cancer sample is a gastric cancer sample; when the P value is <0.5, a gastric benign sample is judged. The following is a probabilistic predictive formula for the set of protein biomarkers, where the names of the protein biomarkers refer to the normalized content values of the protein biomarkers:
i. for Log2+ CycLoess normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-64.187-1.124 x ca-II +4.063 x lrg)));
for Log2+ GI normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-73.508-1.365 x ca-II +4.720 x lrg)));
For Log2 normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-69.978-1.574 x ca-ii+4.702 x lrg)));
for Log2+ Mean normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-73.034-1.379 x ca-II +4.704 x lrg)));
for Log2+ Median normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-65.803-1.325 x ca-II +4.297 x lrg)));
for Log2+ Quantile normalization, the probability prediction formula for this set of markers is p=1/(1 + exp (- (-59.740-0.628 x ca-II +3.446 x lrg)));
for Log2+rlr normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-70.260-1.473 ca-ii+4.637 lrg)));
for VSN normalization, the probability prediction formula for this set of markers is p=1/(1+exp (- (-48.060-2.711 ca-ii+4.534 lrg))).
CN202111316030.XA 2021-11-08 2021-11-08 Protein biomarker and logistic regression prediction model for gastric cancer screening Pending CN116087516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111316030.XA CN116087516A (en) 2021-11-08 2021-11-08 Protein biomarker and logistic regression prediction model for gastric cancer screening

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111316030.XA CN116087516A (en) 2021-11-08 2021-11-08 Protein biomarker and logistic regression prediction model for gastric cancer screening

Publications (1)

Publication Number Publication Date
CN116087516A true CN116087516A (en) 2023-05-09

Family

ID=86205093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111316030.XA Pending CN116087516A (en) 2021-11-08 2021-11-08 Protein biomarker and logistic regression prediction model for gastric cancer screening

Country Status (1)

Country Link
CN (1) CN116087516A (en)

Similar Documents

Publication Publication Date Title
US11193935B2 (en) Compositions, methods and kits for diagnosis of lung cancer
CN111863250B (en) Combined diagnosis model and system for early breast cancer
US20170168058A1 (en) Compositions, methods and kits for diagnosis of lung cancer
CN113960235A (en) Application and method of biomarker in preparation of lung cancer detection reagent
WO2021129881A1 (en) Biomarkers for detecting colorectal cancer or adenoma and methods thereof
CN115144599A (en) Application of protein combination in preparation of kit for carrying out prognosis stratification on thyroid cancer of children, and kit and system thereof
CN115128285B (en) Kit and system for identifying and evaluating thyroid follicular tumor by protein combination
CN111430030A (en) Application method and system of biomarker in ovarian cancer assessment
CN114822854B (en) Gastric mucosa lesion progress and gastric cancer related urine protein marker and application thereof
CN115575553A (en) Application of serum metabolic marker as EGFR mutation detection and detection system
CN108334747B (en) Method for obtaining tumor urine protein marker and obtained tumor-related outlier urine protein library
CN116087516A (en) Protein biomarker and logistic regression prediction model for gastric cancer screening
US20170269090A1 (en) Compositions, methods and kits for diagnosis of lung cancer
CN113785199B (en) Protein characterization for diagnosing colorectal cancer and/or pre-cancerous stage
CN109609639B (en) Colorectal cancer detection method and system
CN117310166A (en) Protein biomarker for early screening of gastric cancer and logistic regression prediction model constructed by same
CN109762900B (en) Colorectal cancer marker and application thereof
CN112630432A (en) Application of FLNA, FBLN1 and TSP-1 as markers in preparation of asbestos-related disease detection kit
JP2023514809A (en) Biomarkers for diagnosing ovarian cancer
CN116735889B (en) Protein marker for early colorectal cancer screening, kit and application
CN111968702A (en) Early malignant tumor screening system based on circulating tumor DNA
CN111751551A (en) Protein molecule as biomarker for diagnosing liver cirrhosis and prognosis method thereof
CN115792247B (en) Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system
CN116879558B (en) Ovarian cancer diagnosis marker, detection reagent and detection kit
CN116087515A (en) Biomarker combinations for gastric cancer diagnosis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication