CN116106534B - Application of biomarker combination in preparation of lung cancer prediction product - Google Patents

Application of biomarker combination in preparation of lung cancer prediction product Download PDF

Info

Publication number
CN116106534B
CN116106534B CN202310376608.3A CN202310376608A CN116106534B CN 116106534 B CN116106534 B CN 116106534B CN 202310376608 A CN202310376608 A CN 202310376608A CN 116106534 B CN116106534 B CN 116106534B
Authority
CN
China
Prior art keywords
lung cancer
samples
model
prediction model
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310376608.3A
Other languages
Chinese (zh)
Other versions
CN116106534A (en
Inventor
张磊
李美娟
李腾腾
成晓亮
张伟
周岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Pinsheng Medical Technology Co ltd
Shanghai Ammonia Biotechnology Co ltd
Nanjing Pinsheng Medical Laboratory Co ltd
Original Assignee
Nanjing Pinsheng Medical Technology Co ltd
Shanghai Ammonia Biotechnology Co ltd
Nanjing Pinsheng Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Pinsheng Medical Technology Co ltd, Shanghai Ammonia Biotechnology Co ltd, Nanjing Pinsheng Medical Laboratory Co ltd filed Critical Nanjing Pinsheng Medical Technology Co ltd
Priority to CN202310376608.3A priority Critical patent/CN116106534B/en
Publication of CN116106534A publication Critical patent/CN116106534A/en
Application granted granted Critical
Publication of CN116106534B publication Critical patent/CN116106534B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/62Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/12Pulmonary diseases
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/52Predicting or monitoring the response to treatment, e.g. for selection of therapy based on assay results in personalised medicine; Prognosis
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/70Mechanisms involved in disease identification
    • G01N2800/7023(Hyper)proliferation
    • G01N2800/7028Cancer
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Urology & Nephrology (AREA)
  • Analytical Chemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Primary Health Care (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Hematology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Cell Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Microbiology (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medicinal Chemistry (AREA)
  • Food Science & Technology (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioethics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Electrochemistry (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention provides application of a biomarker combination in preparing a lung cancer prediction product, wherein the lung cancer prediction product is used for predicting whether a subject is lung cancer or non-lung cancer; the biomarker combination comprises: pterin, monopterin, 6-carboxypterin, 2, 4-dioxatetrahydropterin, 7-hydroxy-2, 4-dioxatetrahydropterin, neopterin, biopterin, aminopterin, N- (4-aminobenzoyl) -L-glutamic acid, inosine, adenosine, 8-hydroxy-2-deoxyguanosine, 5-methyluridine, xanthosine, cytidine, guanosine, or pseudouridine, or a combination of at least two thereof. The lung cancer classification prediction model is constructed through biomarker combinations so as to improve the sensitivity and specificity of clinical detection, thereby realizing early screening and early diagnosis of lung cancer, and good disease management, further improving the prognosis of patients and improving the survival rate of patients.

Description

Application of biomarker combination in preparation of lung cancer prediction product
Technical Field
The invention belongs to the technical field of biological detection, and particularly relates to application of a biomarker combination in preparation of lung cancer prediction products.
Background
Lung cancer is one of the malignant tumors that originate in the bronchus mucosa or glands of the lung, and the most rapidly increases in morbidity and mortality, and is the most threatening to the health and life of people. Lung cancer is largely divided into non-small cell lung cancer (non small cell lung cancer, NSCLC) and small cell lung cancer (small cell lung cancer, SCLC), with non-small cell lung cancer accounting for about 85% of the total diagnosis rate of lung cancer. Non-small cell lung cancer can be further classified as adenocarcinoma, squamous carcinoma, large cell carcinoma, etc., where adenocarcinoma is the most common subtype of lung cancer, followed by squamous cell carcinoma.
Lung cancer is the leading cause of malignancy, and regardless of men and women, the mortality rate is also the leading cause of malignancy, with a 5-year survival rate of only 19.7%, one of the leading causes of lung cancer being generally not diagnosed until the advanced stage of cancer. Therefore, early screening and diagnosis of lung cancer is important. Accurate early screening and early diagnosis are critical to the personalized treatment of lung cancer patients, and the identification of biomarkers with sensitivity and specificity is urgently needed to assist in early diagnosis and treatment of lung cancer patients.
Existing methods for diagnosing lung cancer include chest X-ray examination, chest CT examination, positron emission tomography examination, bronchofiberscope examination, and thoracocentesis cytology examination. However, due to screening efficacy and implementation limitations, these screening methods have relatively little impact on reducing cancer mortality. In addition, when screening methods are unable to distinguish between malignant diseases, overdetection can occur, exposing the patient to unnecessary treatment procedures and significant risks that may reduce the quality of life of the patient. Therefore, the development of new means for preventing and screening lung cancer has important significance for early screening of lung cancer.
Urine modified nucleosides are another well-known class of cancer biomarkers that are caused by chemical modification and damage of free nucleosides and DNA and RNA-bound nucleosides. Thus, urine modified nucleosides have been popular as markers of cancer, including epithelial cell cancer. Feng Lu et al (feasibility study of 5 modified nucleosides in urine for screening molecular markers of high risk groups of lung cancer. Modern oncology, 2010.18 (1): pages 35-38.) found that five modified nucleosides in urine of lung cancer patients were higher than normal, and Pseu, m1A, m1I, m1G and m2G in urine could become molecular markers for lung cancer diagnosis, and could also contribute to lung cancer pathological typing.
The development progress of mass spectrometry technology is benefited, and a large number of metabolic pathway changes are revealed by a wide metabolic map in urine, so that potential biomarkers are screened out, and a method for detecting lung cancer based on a multi-biomarker combination of metabolites has important application value.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide application of biomarker combinations in preparation of lung cancer prediction products, wherein the biomarker combinations are pteridine metabolites and modified nucleoside metabolites, and a lung cancer classification prediction model is constructed through the biomarker combinations so as to improve sensitivity and specificity of clinical detection, thereby realizing early screening and early diagnosis of lung cancer, and performing disease management, further improving prognosis of patients and increasing survival rate of the patients.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
in a first aspect, the invention provides the use of a biomarker combination in the manufacture of a lung cancer prediction product for predicting whether a subject is lung cancer or non-lung cancer;
the biomarker combination comprises: pterin, monopterin, 6-carboxypterin, 2, 4-dioxatetrahydropterin, 7-hydroxy-2, 4-dioxatetrahydropterin, neopterin, biopterin, aminopterin, N- (4-aminobenzoyl) -L-glutamic acid, inosine, adenosine, 8-hydroxy-2-deoxyguanosine, 5-methyluridine, xanthosine, cytidine, guanosine, or pseudouridine, or a combination of at least two thereof.
Preferably, the biomarker combination comprises: a combination of at least 7 of pterin, monopterin, 6-carboxypterin, 2, 4-dioxatetrahydropterin, 7-hydroxy-2, 4-dioxatetrahydropterin, neopterin, biopterin, aminopterin, N- (4-aminobenzoyl) -L-glutamic acid, inosine, adenosine, 8-hydroxy-2-deoxyguanosine, 5-methyluridine, xanthosine, cytidine, guanosine, or pseudouridine.
Preferably, the biomarker combination comprises:
neopterin, 8-hydroxy-2-deoxyguanosine, 7-hydroxy-2, 4-dioxotetrahydropteridine, biopterin, adenosine, pseudouridine (pyrimidine nuclear) glycoside, guanosine;
or, 8-hydroxy-2-deoxyguanosine, monopterin, 2, 4-dioxytetrahydropteridine, aminopterin, biopterin, xanthosine, pseudouridine;
or, 8-hydroxy-2-deoxyguanosine, monopterin, 2, 4-dioxytetrahydropteridine, biopterin, adenosine, pterin, pseudouridine (pyrimidine nuclear);
or, 8-hydroxy-2-deoxyguanosine, monopterin, aminopterin, biopterin, adenosine, pterin, pseudouridine (pyrimidine nuclear);
or neopterin, 8-hydroxy-2-deoxyguanosine, 7-hydroxy-2, 4-dioxotetrahydropteridine, biopterin, adenosine, pterin, pseudouridine (pyrimidine nucleus).
In a second aspect, the present invention provides a method for constructing a lung cancer classification prediction model, the method comprising:
(1) And (3) data acquisition: performing mass spectrum detection to obtain mass spectrum data of biomarker combinations in urine sample samples of a control group and a lung cancer group;
(2) And (3) constructing a prediction model: randomly dividing mass spectrum data of samples of a control group and a lung cancer group into a training set and a testing set in a ratio of 1:1; inputting a machine learning model, optimizing parameters, training by using a training set, testing by using a testing set, and storing the model;
(3) And (3) outputting a prediction result: substituting the numerical value of the molecular marker in the urine sample of the unknown type of the subject into the prediction model, and outputting a classification result.
Preferably, in step (2), the lung cancer group sample is divided into a lung cancer Stage I sample, a Stage II sample, a Stage III sample, a metastasis sample, and an unknown Stage sample.
Preferably, in step (2), the machine learning model includes any one of an Xgboost algorithm, a Catboost algorithm, or a Lightgbm algorithm.
Preferably, in step (2), the machine learning model is an Xgboost algorithm.
Preferably, in step (2), after the prediction model is constructed, the method further includes performing verification evaluation on the prediction model, where the content of the verification evaluation includes: area under ROC curve AUC, sensitivity and specificity.
Preferably, in step (3), for a sample of urine from a subject of unknown type, p is calculated based on a predictive model i Value, p i The value is the probability value P, and P is the probability value that the sample belongs to lung cancer.
Preferably, said p i The values were calculated using the following formula:
Figure SMS_1
where k is the number of built trees, F is the subspace of F, F is the set of all trees, p i Representing the predicted value of F for sample I.
Preferably, in the prediction result output module, a cutoff value is set, a setting range of the cutoff value is 0.4-0.6, if P is greater than the set cutoff, the sample type predicts as a lung cancer sample, otherwise, the sample type predicts as a non-lung cancer sample.
The numerical ranges recited herein include not only the recited point values, but also any point values between the recited numerical ranges that are not recited, and are limited to, and for the sake of brevity, the invention is not intended to be exhaustive of the specific point values that the recited range includes.
Compared with the prior art, the invention has the following beneficial effects:
(1) The molecular concentration value can be detected rapidly and efficiently based on mass spectrometry technology, and the sensitivity and the specificity are high.
(2) The marker models based on 17 molecules can predict whether a subject suffers from lung cancer, and can be used as auxiliary diagnosis for the existing clinical detection of lung cancer, and the sensitivity and specificity of clinical detection are improved, so that early screening and early diagnosis of lung cancer are realized, disease management is realized, the prognosis of a patient is improved, and the survival rate of the patient is improved.
Drawings
FIG. 1 is a box-plot display of 17 molecular markers distributed in lung cancer and control groups;
FIG. 2 is a principal component analysis display of PCA of 17 molecular markers distributed in lung cancer group and control group;
fig. 3 is an illustration of ROC curves for model 1 [ neopterin, 8-hydroxy-2-deoxyguanosine, 7-hydroxy-2, 4-dioxytetrahydropteridine, biopterin, adenosine, pseudouridine (pyrimidine nuclear) glycoside, guanosine ] on both the training and test sets.
Detailed Description
The technical scheme of the invention is further described by the following specific embodiments. It will be apparent to those skilled in the art that the examples are merely to aid in understanding the invention and are not to be construed as a specific limitation thereof.
The specific techniques or conditions are not identified in the examples and are described in the literature in this field or are carried out in accordance with the product specifications. The reagents or apparatus used were conventional products commercially available through regular channels, with no manufacturer noted.
Example 1
(1) Collecting a sample:
collecting urine samples of lung cancer patients and urine samples of control people, taking middle-stage morning urine, storing the samples in a refrigerator at-20 ℃, sending the samples to a Nanjing medical laboratory within three days, and storing the samples in a refrigerator at-80 ℃ before the experiment.
(2) Treatment of samples before mass spectrometry:
after urine sample is filtered by a 0.22 mu m filter membrane, 20 mu L of sample to be detected is taken and placed in a marked 1.5 mL EP tube, 180 mu L of 50% methanol-0.01% formic acid water is added, and 80 mu L of supernatant is transferred for sample injection after vortex mixing.
(3) The mass spectrum detection flow method comprises the following steps:
chromatographic conditions:
A. chromatographic column: BEH C18 (2.1X10 mm,1.7μm);
B. mobile phase: mobile phase a: pure water (containing 0.1% formic acid); mobile phase B: methanol;
C. the chromatographic gradients are shown in table 1:
TABLE 1
Figure SMS_2
D. The ion source parameters are shown in table 2:
TABLE 2
Figure SMS_3
E. The mass spectral parameters are shown in table 3:
TABLE 3 Table 3
Figure SMS_4
F. Instrument parameters: qlife Lab 9000plus triple quadrupole mass spectrometer (biomedical); qlife Lab 9000 high performance liquid chromatography system (G7167A autosampler, p.m.); the system operating software was MS quantitative analysis 10.0.10.0 (MS quantitative analysis 10.0).
(4) Data processing of mass spectrum (data quality control, PCA, OPLS-DA, ROC analysis, etc.)
1. The control samples had 319 individual urine samples and the statistical results of the clinical information are shown in Table 4. The results of clinical statistics of urine samples from 278 lung cancer patients are shown in Table 5. The significance statistics for 17 molecular markers in group 2 are shown in table 6.
Table 4 shows the clinical information statistics of lung cancer control samples.
TABLE 4 Table 4
Figure SMS_5
Table 4 notes: 62 The (55-66) format indicates the median of the quartiles (25% quantile-75% quantile), "-" indicates no, missing: 83 indicates that there are 83 sample data Missing in a certain feature.
Table 5 shows the clinical information and molecular marker statistics of lung cancer samples
TABLE 5
Figure SMS_6
Table 5 notes: 61 The (53-66) format indicates that the median of the quartiles is 61 (25% quantile is 53-75% quantile is 66), "-" indicates no, and Missing: 51 indicates that there are 51 sample data Missing in a certain feature.
The median of each molecule in the control group, the median of the lung cancer group, and the median (25% quantile-75 quantile) and the significance statistics of 17 molecules in the control group and the lung cancer group were calculated, respectively.
TABLE 6
Figure SMS_7
The molecular markers in the table are compared with the Chinese and English:
pterin: pterin; monopterin: monapterin; 6-carboxypterin: 6-carboxyptin; 2, 4-dioxotetrahydropteridine: lumazine; 7-hydroxy-2, 4-dioxotetrahydropteridine: 7-hydroxyumazine; neopterin: neoprerin; biopterin: biopterin; epstein: sepiapterin; n- (4-aminobenzoyl) -L-glutamic acid: n- (4-aminobenzoyl) L-glutamic acid; inosine: inosine; adenosine: adenosine free base; 8-hydroxy-2-deoxyguanosine: 8-Oxo-2-deoxyguline; 5-methyluridine: 5-methylluridine; xanthosine (purine nucleus) glycoside: xanthosine; cytosine nucleosides: cytidine; guanosine: guanosine; pseudouridine (pyrimidine nuclear) glycoside: beta-pseudolaridine.
The significance statistic Test method uses a non-parametric statistical method, mann-Whitney U Test, with P values less than 0.05 indicating that the molecules are statistically significant. The plurality of molecules were statistically significant, and were each of monopterin, 6-carboxypterin, 2, 4-dioxytetrahydropterin, neopterin, biopterin, aminopterin, N- (4-aminobenzoyl) -L-glutamic acid, 8-hydroxy-2-deoxyguanosine, 5-methyluridine, and pseudouridine (pyrimidine nuclear) significant.
FIG. 1 is a box-plot display of 17 molecular markers distributed across lung cancer groups and control groups, where the data were normalized using the Z-Score method.
Fig. 2 shows that PCA principal component analysis of 17 molecular markers distributed in lung cancer group and control group shows that the first principal component has an interpretability of 36.4%, the second principal component has a interpretability of 9.4%, and the total of 2 principal components has a interpretability of 45.8%.
(5) Mass spectrometry results
1. Xgboost algorithm introduction
And constructing a training set prediction model by using an Xgboost algorithm, and verifying the performance of the model by using a test set.
The Xgboost algorithm (Extreme Gradient Boosting, extreme gradient enhancement) is an integrated learning algorithm realized under the Gradient Boosting framework, can integrate a plurality of weak classifier reinforcement learning training models and prediction samples, is suitable for regression and classification prediction problems, and can realize multi-thread parallel calculation, so that the algorithm is particularly suitable for data sets with large and complex sample size, and can even process missing value characteristic data sets.
In general, the Xgboost integrated tree model can be expressed as
Figure SMS_8
K is the number of built trees, F is the subspace of F, F is the set of all trees, p i Representing the predicted value of F for sample I. Xgboost first builds an initial predictive model as a first tree, usually the initial predictive model takes directly 0.5, representing that the predictive value of all samples is 0.5, and calculates p for each sample i Values, then calculate the Residual (loss error) for each sample, the loss error of the second model to the initial modelAnd building a tree, and similarly, building a tree for the loss error of the second model by the third model, and the like until all the trees with a specific number are built or an iteration termination condition is reached. The number of tree models and the iteration termination conditions require user-customization.
Specifically, xgboost builds a second tree model until the last tree model, each model needs to find the optimal set of parameters for tree building by optimizing the objective function, which involves determining the tree structure by minimizing the objective function. The objective function may be defined herein as
Figure SMS_11
N represents the number of samples. The first term in the formula->
Figure SMS_13
Is the Loss Function (Loss Function), equation second term +.>
Figure SMS_15
Is a regularization term. Class model, loss function->
Figure SMS_10
. Typically a taylor second order expansion loss function is used, and assuming progress to the T-th tree model, the objective function of this model can be redefined as +.>
Figure SMS_12
,g i Represents a monovalent derivative, h i Representing the second derivative, the Constant term Constant is deduced and removed in the later formula, f in the formula t Defined as->
Figure SMS_14
. w is the score vector of the tree node, T is the number of leaf nodes, and q is the child node in T. Regularization term may be defined as +.>
Figure SMS_16
. Finally, the objective function can be redefined +.>
Figure SMS_9
. The gamma value is used to calculate a tree Gain value (Gain) to determine whether to prune the tree.
One of the important parameters in the Xgboost is that the Learning Rate (Learning Rate) takes a value between 0 (more than 0) and 1 (less than 1), and the Learning Rate is multiplied by the prediction result from the second tree model to the last tree model, so that the prediction values of the models can be reduced, the model errors iterate towards the minimum value direction, and the model prediction new sample performance is improved.
Of course, other integration algorithms, such as the Catboost algorithm, the Lightgbm algorithm, etc., may be used, as well as similar results.
2. Sample grouping and model introduction
The total samples included 319 healthy human samples and 278 lung cancer samples. The samples are randomly divided into a training set and a testing set according to a ratio of 1:1, wherein the training set comprises 160 control samples and 140 lung cancer samples, and the lung cancer samples comprise lung cancer Stage I40, stage II 15, stage III 18 and transfer 37, and unknown Stage samples 30. The test set included 159 control samples and 138 lung cancer samples, including lung cancer Stage I40, stage II 15, stage III 18, metastasis 36, unknown Stage samples 29. The grouping ratio of the samples of each stage of lung cancer is close to 1:1.
In the python3.9 programming environment, the Xgboost (version=1.6.2) package is called to perform algorithm modeling. The grid search method is used for screening and setting super parameters in the Xgboost algorithm, wherein the range of Learning Rate (parameter name in a software package is learning_Rate) can be set to be [0.05, 0.1, …, 0.3], the interval range is 0.05, the Learning target (parameter name in the software package is Objective) can be set to be Logistic (logic) in the Binary, the gamma parameter value is [0, 0.25, 0.5, …, 2], the interval range is 0.25, the number of tree models is 1000, the range of tree depth value is [2,3,4,5,6], the iteration termination condition is 20, the ratio of sample to feature sampling in the modeling process is [0.9, 1], and the classification model evaluation index is AUCROC (curve area). The range of values of these parameters is not limited to the above range, and can be adjusted according to modeling data.
Based on the 17 molecular-screened marker combinations, the model predicted the highest value of AUCROC in the test set to be 0.826 (95% confidence interval range 0.776-0.871), the corresponding sensitivity to be 0.739 (95% confidence interval range 0.664-0.810) and specificity to be 0.805 (95% confidence interval range 0.739-0.865), and table 7 lists 5 model combinations for test set AUCROC values above 0.8. The model with better performance is not limited to the model of the table example, and a plurality of better models can be obtained based on 17 molecular markers.
Table 7 shows 5 model combinations for test set AUCROC values above 0.8
TABLE 7
Figure SMS_17
The calculation formulas of the indexes in the table are as follows, true positive sample number, true negative sample number, total positive sample number, all sample numbers, predicted positive, predicted positive sample number, predicted negative and predicted negative sample number.
Figure SMS_18
Each marker combination has a specific tree model structure, the tree model structures are complex and are difficult to intuitively display by formulas or tree diagrams, the structures and formulas of each marker combination model are not specifically shown, and the specific model building process is described in the introduction part of the Xgboost algorithm.
Fig. 3 is an illustration of ROC curves for model 1 [ neopterin, 8-hydroxy-2-deoxyguanosine, 7-hydroxy-2, 4-dioxytetrahydropteridine, biopterin, adenosine, pseudouridine (pyrimidine nuclear) glycoside, guanosine ] on both the training and test sets.
As can be seen from ROC curves, the average curve area of the training set of the control group and the lung cancer group in the cross-validation experiment is 0.912, the standard deviation is 0.016, the average curve area of the validation set in the cross-validation experiment is 0.826, the standard deviation is 0.025, the model has no over-fitting and under-fitting problems, and the model curve area of the validation set exceeds 0.7, which indicates that the model performance is better.
Lung cancer samples were further grouped according to TNM staging, and lung cancer samples in the test set included Stage I40, stage II 15, stage III 18, shift 36, and 29 unknown Stage samples. The classification accuracy of the 5 grouped samples in the test set is shown in table 8. Taking the model 1 prediction test set as an example, stage I40 samples were correctly predicted 27, stage II samples were correctly predicted 12 at a sample ratio of 0.675,Stage II 15 samples, stage II samples were correctly predicted 12 at a sample ratio of 0.80,Stage III 18 samples, stage III samples were 0.667, lung cancer metastasis 36 samples were correctly predicted 30, stage III samples were 0.833, unknown lung cancer type 29 samples were correctly predicted 21, and Stage III samples were 0.724. The prediction accuracy of the 5 models for all stages is above 0.6, the prediction accuracy of the 5 models for Stage I is between 0.65 and 0.75, the prediction accuracy of the 5 models for Stage II is between 0.73 and 0.94, the prediction accuracy of the 5 models for Stage III is between 0.61 and 0.78, the prediction accuracy of the 5 models for Metastasis samples is above 0.8, and the prediction accuracy of the 5 models for unknown TNM Stage samples is above 0.72.
Table 8 shows the prediction accuracy of the stage samples of the lung cancer of the test set
TABLE 8
Figure SMS_19
For new samples of unknown type, based on the values of 17 molecular markers and a model of a certain marker combination, one can calculate
Figure SMS_20
P in the formula i Value, p i The value is also a probability value P, P being the probability value that the sample belongs to lung cancer. Other values such as 0.4 or 0.5 or 0.6 may be set to the value of Cutoff, if P is greater than the set value of Cutoff, then the sample type is predicted to be lung cancer, otherwise non-lung cancer.
In conclusion, the method uses the combination of pteridine and modified nucleoside metabolites as biomarkers to construct a lung cancer classification prediction model, and the prediction model has higher sensitivity and specificity of clinical detection, and has important significance for early screening and early diagnosis of lung cancer, disease management, patient prognosis improvement and patient survival rate improvement.
The applicant declares that the above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and it should be apparent to those skilled in the art that any changes or substitutions that are easily conceivable within the technical scope of the present invention disclosed by the present invention fall within the scope of the present invention and the disclosure.

Claims (9)

1. The use of a biomarker combination in the manufacture of a lung cancer prediction product, wherein the lung cancer prediction product is used to predict whether a subject is lung cancer or non-lung cancer;
the biomarker combination comprises:
neopterin, 8-hydroxy-2-deoxyguanosine, 7-hydroxy-2, 4-dioxytetrahydropteridine, biopterin, adenosine, pseudouridine (pyrimidine nuclear) glycoside, and guanosine;
or, 8-hydroxy-2-deoxyguanosine, monopterin, 2, 4-dioxytetrahydropteridine, aminopterin, biopterin, xanthosine, and pseudouridine;
or, 8-hydroxy-2-deoxyguanosine, monopterin, 2, 4-dioxytetrahydropteridine, biopterin, adenosine, pterin, and pseudouridine (pyrimidine nuclear);
or, 8-hydroxy-2-deoxyguanosine, monopterin, aminopterin, biopterin, adenosine, pterin, and pseudouridine (pyrimidine nuclear);
or neopterin, 8-hydroxy-2-deoxyguanosine, 7-hydroxy-2, 4-dioxotetrahydropteridine, biopterin, adenosine, pterin, and pseudouridine (pyrimidine nucleus).
2. A model for classifying and predicting lung cancer, said model comprising:
(1) And a data acquisition module: performing mass spectrum detection to obtain mass spectrum data of the biomarker combination in claim 1 in urine sample samples of a control group and a lung cancer group;
(2) The prediction model building module: randomly dividing mass spectrum data of samples of a control group and a lung cancer group into a training set and a testing set; inputting a machine learning model, optimizing parameters, training by using a training set, testing by using a testing set, and storing the model;
(3) The prediction result output module is used for: substituting the numerical value of the biomarker in the urine sample of the unknown type of the subject into the prediction model, and outputting a classification result.
3. The lung cancer classification prediction model according to claim 2, wherein the lung cancer group samples are divided into lung cancer Stage I samples, stage II samples, stage III samples, metastasis samples, and unknown Stage samples in the prediction model construction module.
4. The lung cancer classification prediction model of claim 2, wherein the machine learning model in the prediction model construction module comprises any one of an Xgboost algorithm, a Catboost algorithm, or a Lightgbm algorithm.
5. The lung cancer classification prediction model of claim 2, wherein in the prediction model construction module, the machine learning model is an Xgboost algorithm.
6. The lung cancer classification prediction model according to claim 2, wherein the prediction model construction module further comprises performing a verification evaluation on the prediction model after the prediction model is constructed, and the content of the verification evaluation comprises: area under ROC curve AUC, sensitivity and specificity.
7. The lung cancer classification prediction model according to claim 2, wherein the prediction result output module is based on, for a urine sample of a subject of unknown typeCalculation in predictive modelp i The value of the sum of the values,p i the value is the probability value P, and P is the probability value that the sample belongs to lung cancer.
8. The lung cancer classification prediction model of claim 7, wherein thep i The values were calculated using the following formula:
Figure QLYQS_1
in the method, in the process of the invention,kthe number of the tree is set up,fis thatFIs used in the space of the sub-space of (a),Fis the set of all the trees that are to be combined,p i representation ofFFor the sampleIIs a predicted value of (a).
9. The model of claim 8, wherein in the prediction result output module, a cutoff value is set, the set range of the cutoff value is 0.4-0.6, and if P is greater than the set cutoff value, the sample type is predicted as a lung cancer sample, otherwise the sample is a non-lung cancer sample.
CN202310376608.3A 2023-04-11 2023-04-11 Application of biomarker combination in preparation of lung cancer prediction product Active CN116106534B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310376608.3A CN116106534B (en) 2023-04-11 2023-04-11 Application of biomarker combination in preparation of lung cancer prediction product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310376608.3A CN116106534B (en) 2023-04-11 2023-04-11 Application of biomarker combination in preparation of lung cancer prediction product

Publications (2)

Publication Number Publication Date
CN116106534A CN116106534A (en) 2023-05-12
CN116106534B true CN116106534B (en) 2023-06-27

Family

ID=86256450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310376608.3A Active CN116106534B (en) 2023-04-11 2023-04-11 Application of biomarker combination in preparation of lung cancer prediction product

Country Status (1)

Country Link
CN (1) CN116106534B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117388495B (en) * 2023-12-13 2024-02-09 哈尔滨脉图精准技术有限公司 Application of metabolic marker for diagnosing lung cancer stage and kit

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611420A (en) * 2021-08-11 2021-11-05 季凯 Disease screening method and system based on blood examination indexes
CN114664440A (en) * 2021-09-27 2022-06-24 上海爱谱蒂康生物科技有限公司 Prediction method and application of breast cancer metastasis
CN113960215B (en) * 2021-11-09 2024-03-26 上海市第一人民医院 Marker for lung adenocarcinoma diagnosis and application thereof
CN114373510B (en) * 2021-11-09 2023-12-01 武汉迈特维尔医学科技有限公司 Metabolic marker for diagnosing or monitoring lung cancer and screening method and application thereof

Also Published As

Publication number Publication date
CN116106534A (en) 2023-05-12

Similar Documents

Publication Publication Date Title
Xi et al. Statistical analysis and modeling of mass spectrometry-based metabolomics data
Yu et al. Association of omics features with histopathology patterns in lung adenocarcinoma
US20240087754A1 (en) Plasma based protein profiling for early stage lung cancer diagnosis
JP7057913B2 (en) Big data analysis method and mass spectrometry system using the analysis method
US8478534B2 (en) Method for detecting discriminatory data patterns in multiple sets of data and diagnosing disease
Yasui et al. An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers
Zhou et al. Rapid mass spectrometric metabolic profiling of blood sera detects ovarian cancer with high accuracy
Liu et al. Potential role of lncRNA H19 as a cancer biomarker in human cancers detection and diagnosis: a pooled analysis based on 1585 subjects
Szymańska et al. Altered levels of nucleoside metabolite profiles in urogenital tract cancer measured by capillary electrophoresis
CN116106534B (en) Application of biomarker combination in preparation of lung cancer prediction product
US20170059581A1 (en) Methods for diagnosis and prognosis of inflammatory bowel disease using cytokine profiles
CN111128385A (en) Prognosis early warning system for esophageal squamous carcinoma and application thereof
Sun et al. Artificial intelligence defines protein-based classification of thyroid nodules
CN111440869A (en) DNA methylation marker for predicting primary breast cancer occurrence risk and screening method and application thereof
CN115798712A (en) System and biomarker for diagnosing whether person to be tested is breast cancer
CN112748191A (en) Small molecule metabolite biomarker for diagnosing acute diseases, and screening method and application thereof
Sun et al. Protein classifier for thyroid nodules learned from rapidly acquired proteotypes
WO2012107786A1 (en) System and method for blind extraction of features from measurement data
CN114755422B (en) Biomarker for colorectal cancer detection and application thereof
CN114758719B (en) Colorectal cancer prediction system and application thereof
CN111944900A (en) Characteristic lincRNA expression profile combination and early endometrial cancer prediction method
CN116106535B (en) Application of biomarker combination in preparation of breast cancer prediction product
CN111584005A (en) Classification model construction algorithm based on fusion of different mode markers
CN115678999B (en) Application of marker in lung cancer recurrence prediction and prediction model construction method
CN111965238A (en) Products, uses and methods for non-small cell lung cancer-related screening and assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant