CN115966299A - Disease diagnosis model based on MALDI-ToF MS data - Google Patents

Disease diagnosis model based on MALDI-ToF MS data Download PDF

Info

Publication number
CN115966299A
CN115966299A CN202210926734.7A CN202210926734A CN115966299A CN 115966299 A CN115966299 A CN 115966299A CN 202210926734 A CN202210926734 A CN 202210926734A CN 115966299 A CN115966299 A CN 115966299A
Authority
CN
China
Prior art keywords
ckd
peak
selection
machine learning
peaks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210926734.7A
Other languages
Chinese (zh)
Inventor
周宏伟
李泽文
曾念宜
郑道文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southern Medical University Zhujiang Hospital
Original Assignee
Southern Medical University Zhujiang Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southern Medical University Zhujiang Hospital filed Critical Southern Medical University Zhujiang Hospital
Priority to CN202210926734.7A priority Critical patent/CN115966299A/en
Publication of CN115966299A publication Critical patent/CN115966299A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application relates to a construction method of a Chronic Kidney Disease (CKD) diagnosis model based on MALDI-ToF MS data, the method comprises the following steps: i) Performing characteristic peak screening on MALDI-ToF MS data of urine samples of CKD population and healthy population by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak meeting one of the three machine learning methods as a candidate characteristic peak; in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of the top 30 in PLS-DA selection as a difference characteristic peak of CKD; in RFECV selection, selecting a peak with import score of top 30 as a difference characteristic peak of CKD; ii) screening candidate characteristic peaks satisfying frequency >30% and AUC >60% of each group simultaneously in i) as characteristic peaks; iii) And establishing a CKD disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.

Description

Disease diagnosis model based on MALDI-ToF MS data
Technical Field
The application relates to the field of biomedicine, in particular to a disease diagnosis model based on MALDI-ToF MS data.
Background
Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) technology is a mass spectrometry technology which is published and developed rapidly in the end of the 20 th century and the 80 th century. The mass analyzer is an ion drift tube (iondirfttube), ions generated by an ion source are firstly collected, the speed of all ions in the collector becomes 0, the ions enter the field-free drift tube after being accelerated by a pulse electric field and fly to the ion receiver at a constant speed, and the larger the mass of the ions, the longer the time taken for the ions to reach the receiver; the smaller the mass of the ions, the shorter the time it takes to reach the receiver. According to the principle, ions with different masses can be separated according to the mass-to-charge ratio, the molecular mass and the purity of biomacromolecules such as polypeptide, protein, nucleic acid, polysaccharide and the like can be accurately detected, and the method has the advantages of high accuracy, strong flexibility, large flux, short detection period and high cost performance.
Chronic kidney disease is an increasingly serious public health problem affecting 8-16% of the world's population, with complications including anemia, cognitive decline, osteoporosis, cardiovascular disease, acute kidney injury, renal failure, and the like. Chronic kidney disease is defined as a decrease in glomerular filtration rate with or without concomitant increase in urinary albumin excretion over 3 months in the glomerulus. The prevalence rate of chronic kidney diseases of adults in China is as high as 10.8 (10.2-11.3), the prevalence rate of the chronic kidney diseases is as high as 1.195 million (1.129-1.250 million), but the awareness rate is only 12.5%.
Immunoglobulin (IgA) nephropathy, a disease of the kidney tissue caused by local autoimmune reactions in the kidney by deposition of IgA complexes in the kidney, is one of the most common primary glomerular diseases. Over 30% of patients progress to end-stage renal disease (ESRD) 10-20 years after onset, making IgA nephropathy one of the most common causes of uremia. At present, the IgA nephropathy diagnosis gold standard is pathological tissue biopsy of renal puncture, however, the invasive renal puncture has several defects: (1) Renal puncture does not allow early diagnosis, and can only detect patients in whom the onset of renal injury has developed. (2) Renal puncture presents risks, because a plurality of patients have relative contraindications of renal puncture, or hospitals do not have the conditions of pathological diagnosis of renal puncture, so that the patients cannot obtain definite diagnosis and have targeted treatment. (3) Renal puncture is a costly procedure, equivalent to a single operation, requiring one week of hospitalization. Therefore, there is a great clinical need for the development of noninvasive biomarkers that contribute to the diagnosis and differential diagnosis of IgA nephropathy.
Disclosure of Invention
There have been several worldwide attempts to develop IgAN risk prediction models including the formula developed by the Japanese team to predict the 5-year progression to ESRD for IgAN nephropathy patients based on two Japanese cohorts of people [ PMID:24178970] and by the Chinese team to predict the 5-year progression to ESRD for IgAN nephropathy patients based on the national 7 centers for nephropathy establishment of the CLIN model (based on clinical data) and the CLIN-PATH model (based on clinical data and Oxford MEST scores) [ PMID: 295434 ]. The above formulas are all established based on clinical indexes (urine protein, creatinine, glomerular filtration rate, etc.) and patient population characteristics (age, sex, etc.), but lack of attention to pathophysiological markers of diseases, so that the sensitivity and specificity of many models are insufficient, and the requirements in clinical and scientific research practices cannot be completely met. The urine contains abundant natural endogenous polypeptide fragments, and the natural polypeptide fragments are formed by normal enzyme digestion in physiological activities of the kidney and can reflect the healthy physiological state of the kidney. When kidney function is damaged, corresponding enzyme activity is affected, and the type and abundance of the urine peptide are changed. Urinary peptides may be the best indicators of IgAN nephropathy severity and risk assessment.
The application provides a Matrix-Assisted Laser Desorption/Ionization Time of Flight Mass Spectrometry (MALDI-TOF MS) detection technology, so that the full-spectrum urine polypeptide can be economically, quickly and conveniently detected. And 45 IgAN-related polypeptide biomarkers were selected. And establishing a machine learning model by utilizing the screened IgAN characteristic peaks to identify and diagnose the IgAN nephropathy and nephropathy of other tissue types.
The application provides a CKD and IgAN disease characteristic peak screening standard for the first time, the standard comprehensively considers the efficiency size and relative peak intensity of characteristic peaks, a batch of disease characteristic peaks closely related to diseases are screened, and a CKD and IgAN disease diagnosis model is constructed by using the selected characteristic peaks through a machine learning model, so that an effective means is provided for clinicians to predict the disease risk of patients, evaluate the renal function severity and prognosis of the patients, realize accurate treatment and provide a tailored intervention scheme for the patients at each stage.
In one aspect, the present application provides a method for constructing a Chronic Kidney Disease (CKD) diagnosis model based on MALDI-ToF MS data, the method comprising:
i) Performing characteristic peak screening on MALDI-ToF MS data of urine samples of CKD population and healthy population by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak meeting any one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of 30 th rank as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, selecting a peak with import score ranking of 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
iii) And establishing a CKD disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
In certain embodiments, wherein the differential characteristic peaks between the CKD population and the healthy population are selected from peaks having the following mass-to-charge ratios: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39, M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1,M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6, M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24, M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46, M/Z _1401.5,M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75,M/Z _6133.13, M/Z _1077.66,M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002;
in certain embodiments, wherein the peaks in i) that meet any one of the following criteria are selected as candidate characteristic peaks: in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 20% is selected as a difference characteristic peak of CKD; selecting a peak with VIP score ranking 20 as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, the peak with import score ranking 20 is selected as the difference characteristic peak of CKD.
In certain embodiments, wherein the differential characteristic peaks between the CKD population and the healthy population are selected from peaks having the following mass-to-charge ratios: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39, M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1, M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6, M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24, M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46 M/Z _2427.4 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peak of difference between the CKD population and the healthy population is selected from the group consisting of peaks with the following mass to charge ratios: one or more of M/Z _1049.48, M/Z _1089.32, M/Z _1130.79, M/Z _1157.14, M/Z _1212.03, M/Z _1250.48, M/Z _1557.53, M/Z _1594.15, M/Z _1637.03, M/Z _1752.24, M/Z _1892.39, M/Z _1900.89, M/Z _1909.6, M/Z _1932.05, M/Z _1948.19, M/Z _2037.78, M/Z _2124.48, M/Z _2265.63, M/Z _2427.4, M/Z _3040.02, M/Z _4744.24, M/Z _6215.39 and M/Z _6233.15; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the peaks in i) that meet any one of the following criteria are selected as candidate characteristic peaks: in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 30% is selected as a difference characteristic peak of CKD; selecting a peak with the VIP score of the top 10 in PLS-DA selection as a difference characteristic peak of CKD; in RFECV selection, the peak with import score ranking 10 is selected as the difference characteristic peak of CKD.
In certain embodiments, wherein the characteristic peak of difference between the CKD population and the healthy population is selected from the group consisting of peaks with the following mass to charge ratios: one or more of M/Z-1049.48, M/Z-1157.14, M/Z-1637.03, M/Z-1948.19, M/Z-4744.24, M/Z-6215.39, M/Z-1089.32, M/Z-2265.63, M/Z-1594.15, M/Z-1900.89, M/Z-3040.02, M/Z-1909.6, M/Z-1932.05, M/Z-1250.48, M/Z-1212.03, M/Z-1892.39, M/Z-2124.24248, and M/Z-2427.4; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In some embodiments, the establishing the CKD disease diagnosis model by using the machine learning method comprises establishing the disease diagnosis model by using the machine learning method and applying a modeling method of 5-time repeated 10-fold cross validation.
In some embodiments, the machine learning method comprises: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
In some embodiments, before screening the MALDI-ToF MS data of the urine samples of the CKD population and the healthy population, the quality control processing and the standardization of the MALDI-ToF MS data of the urine samples of the CKD population and the healthy population are further included.
In some embodiments, the quality control process comprises: i) Quality control, ii) variance smoothing, iii) smoothing and baseline correction, and iv) Intensity correction.
In certain embodiments, the quality control comprises that all of the tests contain the same number of data points and are not NA values.
In some embodiments, the variance smoothing comprises using a square root transform on the raw mass spectral data.
In certain embodiments, the smoothing and baseline correction comprises smoothing the spectra using a 21point Savitzky-Golay-Filter method, and then removing baseline noise using a SNIP algorithm.
In some embodiments, said correcting for Intensity comprises: the intensity value is balanced using a total ion current calibration.
In certain embodiments, the pre-treatment comprises:
i) Quality control, including that all of the tests contain the same number of data points and are not NA values;
ii) variance smoothing, the variance smoothing comprising using a square root transform on the raw mass spectral data;
iii) Smoothing and baseline correction, the smoothing and baseline correction comprising smoothing the spectra using a 21point Savitzky-Golay-Filter method, and then removing baseline noise with a SNIP algorithm; and
iv) Intensity correction, said Intensity correction comprising: the intensity value was balanced using a total ion current calibration.
In some embodiments, the normalization process comprises: i) Peak mass alignment, ii) peak identification and iii) peak merging;
wherein the peak quality alignment comprises identifying a first landmark peak present in a majority of the spectra, calculating a non-linear warping function for each spectrum by fitting a local regression to the matched reference peaks;
the peak identification includes identifying a peak having a peak intensity greater than twice a noise value (SNR) ≧ 2) as a signal peak;
the peak combining comprises combining signal peaks to one signal peak in the tolerance range of 0.002 ppm.
In certain embodiments, the normalizing process further comprises removing false positive peaks within the group that are less than 25% frequency.
In certain embodiments, the method comprises:
i) MALDI-ToF MS detection reading is carried out on urine samples of CKD people and healthy people, and fingerprint spectrums of two groups of urine polypeptides are obtained;
ii) performing quality control treatment and standardized treatment on the urine polypeptide fingerprints of CKD (CKD disease) population and healthy population;
iii) Screening characteristic peaks of the fingerprint spectrums of the urine polypeptides of CKD people and healthy people by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak which accords with one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of 30 th rank as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, selecting a peak with import score of top 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
iii) And (3) establishing a CKD disease diagnosis model by using 7 machine learning methods including a Support Vector Machine (SVM), a Random Forest (RF), naive Bayes (NB), gradient Boosting (GBM), K-nearest neighbor (KNN), a conditional inference decision tree (ctre) and an adaptive enhancement (Adaboost) by using the identified difference characteristic peaks.
In certain embodiments, the method further comprises using the AUC indicator to evaluate a disease diagnostic model.
In certain embodiments, the assessing a disease diagnostic model using an AUC indicator comprises validating the disease diagnostic model using MALDI-ToF MS data of an independent CKD population and a healthy population.
In another aspect, the present application provides a method for constructing a Chronic Kidney Disease (CKD) diagnosis model based on MALDI-ToF MS data, the method including:
i) Screening MALDI-ToF MS data of urine samples of CKD population and healthy population to obtain one or more characteristic peaks of difference between CKD population and healthy population with the following mass-to-charge ratio: <xnotran> M/Z _1049.48,M/Z _1157.14, M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32,M/Z _2265.63, M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1909.6,M/Z _1932.05,M/Z _1250.48, M/Z _1212.03,M/Z _1892.39,M/Z _2124.48 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002;
ii) establishing a CKD disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
In certain embodiments, wherein the characteristic peak of difference between the CKD population and the healthy population is selected from the group consisting of:
<xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1, M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6, M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24, M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46 M/Z _2427.4 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peak of difference between the CKD population and the healthy population is selected from the group consisting of:
<xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39, M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1, M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6, M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24, M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46, M/Z _1401.5,M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75,M/Z _6133.13, M/Z _1077.66,M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: M/Z _1948.19, M/Z _1909.6, M/Z _1932.05, M/Z _2427.4 and M/Z _1637.03; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48, M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32, M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1909.6,M/Z _1932.05, M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2124.48 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48, M/Z _1089.32,M/Z _1130.79,M/Z _1157.14,M/Z _1212.03,M/Z _1250.48,M/Z _1557.53, M/Z _1594.15,M/Z _1637.03,M/Z _1752.24,M/Z _1892.39,M/Z _1900.89,M/Z _1909.6, M/Z _1932.05,M/Z _1948.19,M/Z _2037.78,M/Z _2124.48,M/Z _2265.63,M/Z _2427.4, M/Z _3040.02,M/Z _4744.24,M/Z _6215.39 M/Z _6233.15; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48, M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32, M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1,M/Z _2279.32, M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6,M/Z _1932.05, M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24,M/Z _6233.15, M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48, M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32, M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1,M/Z _2279.32, M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6,M/Z _1932.05, M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24,M/Z _6233.15, M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46,M/Z _1401.5, M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75,M/Z _6133.13,M/Z _1077.66, M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, the screening for distinct characteristic peaks between CKD and healthy populations comprises:
i) Performing characteristic peak screening on MALDI-ToF MS data of urine samples of CKD population and healthy population by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak meeting one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of 30 th rank as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, selecting a peak with import score ranking of 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
in some embodiments, the establishing the CKD disease diagnosis model by using the machine learning method comprises establishing the disease diagnosis model by using the machine learning method and applying a modeling method of 5-time repeated 10-fold cross validation.
In some embodiments, the machine learning method comprises: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
In certain embodiments, wherein when the peak of the signature polypeptide M/Z _1948.19, M/Z _1909.6, M/Z _1932.05, M/Z _2427.4, M/Z _1637.03 is up-regulated, the urine sample is a positive sample, i.e., the patient is a CKD patient, the 10 fold cross validation accuracy is not less than 90%.
In certain embodiments, wherein the CKD comprises IgA nephropathy and Non-IgA nephropathy.
In another aspect. The application provides an application of characteristic peaks based on MALDI-ToF MS data in preparing a diagnosis model of CKD, wherein the CKD diagnosis model is a machine learning type model, and the characteristic peaks are selected from peaks with the following mass-to-charge ratios: M/Z1049.48, M/Z1157.14, M/Z1637.03, M/Z1948.19, M/Z4744.24, M/Z6215.39, M/Z1089.32, M/Z2265.63, M/Z1594.15, M/Z1900.89, M/Z3040.02, M/Z1108.1, M/Z2279.32, M/Z2925.93, M/Z1563.44, M/Z2078.53, M/Z1396.23, M/Z1909.6, M/Z1932.05, M/Z6248, M/Z1212.03, M/Z2031892.39, M/Z7.78, M/Z1752.24, M/Z6233.48, M/Z6246, M/Z1554.9, M/Z1287.46, M/Z1283.79, M/Z1284.9.9, M/Z1283.9.05, M/Z1283.9.3.3.3, M/Z1283.3.3.3, M/Z6423, M/Z-D; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In another aspect, the present application provides a diagnostic model of CKD based on MALDI-ToF MS data, the diagnostic model of CKD being a machine learning-based model, the diagnostic model of CKD having a plurality of characteristic peaks of CKD selected from: one or more of M/Z _1049.48, M/Z _1089.32, M/Z _1130.79, M/Z _1157.14, M/Z _1212.03, M/Z _1250.48, M/Z _1557.53, M/Z _1594.15, M/Z _1637.03, M/Z _1752.24, M/Z _1892.39, M/Z _1900.89, M/Z _1909.6, M/Z _1932.05, M/Z _1948.19, M/Z _2037.78, M/Z _2124.48, M/Z _2265.63, M/Z _2427.4, M/Z _3040.02, M/Z _4744.24, M/Z _6215.39 and M/Z _6233.15; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: M/Z _1948.19, M/Z _1909.6, M/Z _1932.05, M/Z _2427.4 and M/Z _1637.03; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48, M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32, M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1909.6,M/Z _1932.05, M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2124.48 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48, M/Z _1089.32,M/Z _1130.79,M/Z _1157.14,M/Z _1212.03,M/Z _1250.48,M/Z _1557.53, M/Z _1594.15,M/Z _1637.03,M/Z _1752.24,M/Z _1892.39,M/Z _1900.89,M/Z _1909.6,M/Z _1932.05,M/Z _1948.19,M/Z _2037.78,M/Z _2124.48,M/Z _2265.63,M/Z _2427.4, M/Z _3040.02,M/Z _4744.24,M/Z _6215.39 M/Z _6233.15; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48, M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32, M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1,M/Z _2279.32, M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6,M/Z _1932.05, M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24,M/Z _6233.15, M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, wherein the characteristic peak of difference between the CKD population and the healthy population comprises: <xnotran> M/Z _1049.48, M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32, M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1,M/Z _2279.32, M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6,M/Z _1932.05, M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24,M/Z _6233.15, M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46,M/Z _1401.5, M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75,M/Z _6133.13,M/Z _1077.66, M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In some embodiments, the machine learning class model comprises: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
In another aspect, the present application provides a method of diagnosing CKD, the method comprising: i) Obtaining a fingerprint of urine polypeptide of a urine sample of a subject, ii) selecting a difference characteristic peak between CKD population and healthy population in the fingerprint, and inputting the difference characteristic peak into the CKD diagnosis model to obtain the probability of whether the patient suffers from CKD.
In certain embodiments, the method comprises using the AUC indicator to determine the probability of whether a subject suffers from CKD.
In another aspect, the present application provides a diagnostic system for CKD, comprising a computing unit executing the diagnostic model for CKD described herein.
In another aspect, the present application provides a method for constructing a diagnostic model for IgA nephropathy (IgAN) based on MALDI-ToF MS data, the method comprising:
i) MALDI-ToF MS data of urine samples of IgAN people and Non-IgAN people are screened to obtain one or more different characteristic peaks with the following mass-to-charge ratios: <xnotran> M/Z _1049.48,M/Z _1212.03,M/Z _1948.19, M/Z _6215.39,M/Z _1594.15,M/Z _2941.82,M/Z _3279.24,M/Z _2265.63,M/Z _1637.03, M/Z _1089.32,M/Z _2427.4,M/Z _1734.95,M/Z _3040.02,M/Z _1267.75,M/Z _1909.6, M/Z _1932.05,M/Z _1250.48,M/Z _2037.78,M/Z _1157.14,M/Z _6233.15,M/Z _1892.39, M/Z _1130.79,M/Z _1900.89,M/Z _1233.81,M/Z _1557.53,M/Z _4744.24,M/Z _2124.48, M/Z _6133.13,M/Z _1394.04,M/Z _1608.59,M/Z _1629.53,M/Z _1686.93,M/Z _1752.24, M/Z _1803.01,M/Z _2412.91,M/Z _2585.93,M/Z _2601.72,M/Z _2726.14,M/Z _2733.04, M/Z _2999.97,M/Z _3021.7,M/Z _3208.35,M/Z _3286.12,M/Z _3324.23 M/Z _6177.18; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002;
ii) establishing an IgAN disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
In certain embodiments, wherein the Non-IgAN population comprises a healthy population and a CKD population of Non-IgA nephropathy.
In another aspect, the present application provides a method for constructing a model for identifying an IgAN and Non-IgAN, the method comprising:
i) Screening MALDI-ToF MS data of urine samples of IgAN people and Non-IgAN people to obtain one or more different characteristic peaks between CKD people and healthy people with the following mass-to-charge ratios: M/Z _1394.04, M/Z _1608.59, M/Z _1629.53, M/Z _1637.03, M/Z _1686.93, M/Z _1752.24, M/Z _1803.01, M/Z _2412.91, M/Z _2585.93, M/Z _2601.72, M/Z _2726.14, M/Z _2733.04, M/Z _2941.82, M/Z _2999.97, M/Z _3021.7, M/Z _3040.02, M/Z _3208.35, M/Z _ 79.24, M/Z _3286.12, M/Z _3324.23, M/Z _6177.18, M/Z _ 323, M/Z _1714.8, M/Z _ 3273.68, M/Z _ 3228.191, M/Z _ 161263.191, M/Z _ 17163.63; wherein the calculated relative error of M/Z (ppm) is within 0.002;
ii) establishing a model for identifying the IgAN and the Non-IgAN by using the identified difference characteristic peak and adopting a machine learning method.
In another aspect, the present application provides an IgA nephropathy (IgAN) diagnostic model based on MALDI-ToF MS data, the IgAN diagnostic model being a machine learning based model, the diagnostic model of IgAN having a plurality of characteristic peaks of IgAN selected from: <xnotran> M/Z _1049.48,M/Z _1212.03,M/Z _1948.19,M/Z _6215.39,M/Z _1594.15, M/Z _2941.82,M/Z _3279.24,M/Z _2265.63,M/Z _1637.03,M/Z _1089.32,M/Z _2427.4, M/Z _1734.95,M/Z _3040.02,M/Z _1267.75,M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _2037.78,M/Z _1157.14,M/Z _6233.15,M/Z _1892.39,M/Z _1130.79,M/Z _1900.89, M/Z _1233.81,M/Z _1557.53,M/Z _4744.24,M/Z _2124.48,M/Z _6133.13,M/Z _1394.04, M/Z _1608.59,M/Z _1629.53,M/Z _1686.93,M/Z _1752.24,M/Z _1803.01,M/Z _2412.91,M/Z _2585.93,M/Z _2601.72,M/Z _2726.14,M Z_2733.04,M/Z _2999.97,M/Z _3021.7, M/Z _3208.35,M/Z _3286.12,M/Z _3324.23 M/Z _6177.18; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In another aspect, the present application provides a diagnostic system for IgAN, which includes a computing unit executing the diagnostic model for IgAN as described in the present application.
In another aspect, the present application provides a method for diagnosing IgA nephropathy, comprising: the MALDI-ToF MS data of the subject urine sample is input into the IgA nephropathy diagnostic model described herein, and the probability of whether the patient suffers from IgAN is obtained.
In certain embodiments, the method comprises using the AUC indicator to determine the probability of whether a subject suffers from IgAN.
In another aspect, the present application provides a model for identifying IgAN and Non-IgAN, wherein the model for identifying IgAN and Non-IgAN is a machine learning model, and the model for identifying IgAN and Non-IgAN has a plurality of characteristic peaks selected from the group consisting of: M/Z _1394.04, M/Z _1608.59, M/Z _1629.53, M/Z _1637.03, M/Z _1686.93, M/Z _1752.24, M/Z _1803.01, M/Z _2412.91, M/Z _2585.93, M/Z _2601.72, M/Z _2726.14, M/Z _2733.04, M/Z _2941.82, M/Z _2999.97, M/Z _3021.7, M/Z _3040.02, M/Z _3208.35, M/Z _3279.24, M/Z _3286.12, M/Z _3324.23, M/Z _6177.18, M/Z _ 1713, M/Z _1714.8, M/Z _ 3273.68, M/Z _ 3263.191, M/Z _ 2863.8; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In another aspect, the present application provides a marker for diagnosing CKD, the marker comprising one or more polypeptides selected from the group consisting of polypeptides having mass-to-charge ratios as follows: <xnotran> M/Z _1212.03,M/Z _1948.19,M/Z _6215.39,M/Z _1594.15,M/Z _2941.82, M/Z _3279.24,M/Z _2265.63,M/Z _1637.03,M/Z _1089.32,M/Z _2427.4,M/Z _1734.95, M/Z _3040.02,M/Z _1267.75,M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _2037.78, M/Z _1157.14,M/Z _6233.15,M/Z _1892.39,M/Z _1130.79,M/Z _1900.89,M/Z _1233.81, M/Z _1557.53,M/Z _4744.24,M/Z _2124.48,M/Z _6133.13,M/Z _1394.04,M/Z _1608.59, M/Z _1629.53,M/Z _1686.93,M/Z _1752.24,M/Z _1803.01,M/Z _2412.91,M/Z _2585.93, M/Z _2601.72,M/Z _2726.14,M/Z _2733.04,M/Z _2999.97,M/Z _3021.7,M/Z _3208.35, M/Z _3286.12,M/Z _3324.23 M/Z _6177.18; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, the marker comprises one or more polypeptides selected from the group consisting of polypeptides having mass to charge ratios as follows: M/Z _1049.48, M/Z _1089.32, M/Z _1108.1, M/Z _1130.79, M/Z _1157.14, M/Z _1212.03, M/Z _1250.48, M/Z _1283.46, M/Z _1396.23, M/Z _1557.53, M/Z _1563.44, M/Z _1594.15, M/Z _1637.03, M/Z _1752.24, M/Z _1892.39, M/Z _1900.89, M/Z _1909.6, M/Z _1932.05, M/Z _1948.19, M/Z _1975.09, M/Z _2037.78, M/Z _2078.53, M/Z _ 5.63, M/Z _2279.32, M/Z _2427.4; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, the marker comprises a polypeptide having a mass-to-charge ratio of: M/Z _1948.19, M/Z _1909.6, M/Z _1932.05, M/Z _2427.4 and/or M/Z _1637.03; wherein the calculated relative error of M/Z (ppm) is within 0.002.
In certain embodiments, the polypeptide wherein the mass to charge ratio is M/Z-1049.48 has the amino acid sequence set forth in SEQ ID NO: 1;
the polypeptide with the mass-to-charge ratio of M/Z _1089.32 has an amino acid sequence shown as SEQ ID NO. 2;
the polypeptide with the mass-to-charge ratio of M/Z-1108.1 has an amino acid sequence shown in SEQ ID NO. 3;
the polypeptide with the mass-to-charge ratio of M/Z _1130.79 has an amino acid sequence shown as SEQ ID NO. 4;
the polypeptide with the mass-to-charge ratio of M/Z _1157.14 has an amino acid sequence shown in SEQ ID NO. 5;
the polypeptide with the mass-to-charge ratio of M/Z _1212.03 has an amino acid sequence shown in SEQ ID NO. 6;
the polypeptide with the mass-to-charge ratio of M/Z _1250.48 has an amino acid sequence shown in SEQ ID NO. 7;
the polypeptide with the mass-to-charge ratio of M/Z _1283.46 has an amino acid sequence shown as SEQ ID NO. 8;
the polypeptide with the mass-to-charge ratio of M/Z _1396.23 has an amino acid sequence shown in SEQ ID NO. 9;
the polypeptide with the mass-to-charge ratio of M/Z _1557.53 has an amino acid sequence shown in SEQ ID NO: 10;
the polypeptide with the mass-to-charge ratio of M/Z _1563.44 has an amino acid sequence shown as SEQ ID NO. 11;
the polypeptide with the mass-to-charge ratio of M/Z _1594.15 has an amino acid sequence shown as SEQ ID NO. 12;
the polypeptide with the mass-to-charge ratio of M/Z _1637.03 has an amino acid sequence shown in SEQ ID NO. 13;
the polypeptide with the mass-to-charge ratio of M/Z _1752.24 has an amino acid sequence shown in SEQ ID NO. 14;
the polypeptide with the mass-to-charge ratio of M/Z _1892.39 has an amino acid sequence shown in SEQ ID NO. 15;
the polypeptide with the mass-to-charge ratio of M/Z _1900.89 has an amino acid sequence shown as SEQ ID NO 16;
the polypeptide with the mass-to-charge ratio of M/Z _1909.6 has an amino acid sequence shown as SEQ ID NO: 16;
the polypeptide with the mass-to-charge ratio of M/Z _1932.05 has an amino acid sequence shown by SEQ ID NO. 18;
the polypeptide with the mass-to-charge ratio of M/Z _1948.19 has an amino acid sequence shown as SEQ ID NO: 19;
the polypeptide with the mass-to-charge ratio of M/Z _1975.09 has an amino acid sequence shown in SEQ ID NO. 20;
the polypeptide with the mass-to-charge ratio of M/Z _2037.78 has an amino acid sequence shown as SEQ ID NO: 21;
the polypeptide with the mass-to-charge ratio of M/Z _2078.53 has an amino acid sequence shown in SEQ ID NO. 22;
the polypeptide with the mass-to-charge ratio of M/Z _2265.63 has an amino acid sequence shown as SEQ ID NO. 23;
the polypeptide with the mass-to-charge ratio of M/Z _2279.32 has an amino acid sequence shown as SEQ ID NO: 24;
the polypeptide with the mass-to-charge ratio of M/Z _2427.4 has an amino acid sequence shown as SEQ ID NO. 25;
in certain embodiments, the agent is for diagnosing CKD.
In another aspect, the present application provides a system for diagnosing CKD comprising a reagent or device that detects a marker described herein.
In another aspect, the present application provides a method of diagnosing CKD, comprising detecting the presence or amount of a marker described herein in a urine sample from a subject.
In another aspect, the present application provides a marker for diagnosing IgAN, the marker comprising one or more polypeptides selected from the group consisting of polypeptides having mass-to-charge ratios as follows: <xnotran> M/Z _1049.48,M/Z _1212.03,M/Z _1948.19,M/Z _6215.39,M/Z _1594.15,M/Z _2941.82,M/Z _3279.24,M/Z _2265.63,M/Z _1637.03,M/Z _1089.32,M/Z _2427.4, M/Z _1734.95,M/Z _3040.02,M/Z _1267.75,M/Z _1909.6,M/Z _1932.05,M/Z _1250.48, M/Z _2037.78,M/Z _1157.14,M/Z _6233.15,M/Z _1892.39,M/Z _1130.79,M/Z _1900.89, M/Z _1233.81,M/Z _1557.53,M/Z _4744.24,M/Z _2124.48,M/Z _6133.13,M/Z _1394.04, M/Z _1608.59,M/Z _1629.53,M/Z _1686.93,M/Z _1752.24,M/Z _1803.01,M/Z _2412.91, M/Z _2585.93,M/Z _2601.72,M/Z _2726.14,M/Z _2733.04,M/Z _2999.97,M/Z _3021.7, M/Z _3208.35,M/Z _3286.12,M/Z _3324.23 M/Z _6177.18. </xnotran>
In certain embodiments, the marker comprises one or more polypeptides selected from the group consisting of polypeptides having mass to charge ratios as follows: <xnotran> M/Z _1049.48,M/Z _1089.32,M/Z _1130.79,M/Z _1157.14,M/Z _1212.03,M/Z _1233.81, M/Z _1250.48,M/Z _1267.75,M/Z _1394.04,M/Z _1557.53,M/Z _1594.15,M/Z _1608.59, M/Z _1629.53,M/Z _1637.03,M/Z _1686.93,M/Z _1734.95,M/Z _1752.24,M/Z _1803.01, M/Z _1892.39,M/Z _1900.89,M/Z _1909.6,M/Z _1932.05,M/Z _1948.19,M/Z _2037.78, M/Z _2265.63,M/Z _2412.91,M/Z _2427.4,M/Z _2726.14,M/Z _2733.04,M/Z _2941.82, M/Z _3324.23. </xnotran>
In certain embodiments, the polypeptide wherein the mass to charge ratio is M/Z-1049.48 has the amino acid sequence set forth in SEQ ID NO: 1;
the polypeptide with the mass-to-charge ratio of M/Z _1089.32 has an amino acid sequence shown as SEQ ID NO. 2;
the polypeptide with the mass-to-charge ratio of M/Z _1130.79 has an amino acid sequence shown as SEQ ID NO. 4;
the polypeptide with the mass-to-charge ratio of M/Z _1157.14 has an amino acid sequence shown in SEQ ID NO. 5;
the polypeptide with the mass-to-charge ratio of M/Z _1212.03 has an amino acid sequence shown in SEQ ID NO. 6;
the polypeptide with the mass-to-charge ratio of M/Z _1233.81 has an amino acid sequence shown as SEQ ID NO: 26;
the polypeptide with the mass-to-charge ratio of M/Z-1250.48 has an amino acid sequence shown in SEQ ID NO. 7;
the polypeptide with the mass-to-charge ratio of M/Z _1267.75 has an amino acid sequence shown in SEQ ID NO. 27;
the polypeptide with the mass-to-charge ratio of M/Z _1394.04 has an amino acid sequence shown in SEQ ID NO. 28;
the polypeptide with the mass-to-charge ratio of M/Z _1557.53 has an amino acid sequence shown in SEQ ID NO: 10;
the polypeptide with the mass-to-charge ratio of M/Z _1594.15 has an amino acid sequence shown as SEQ ID NO. 12;
the polypeptide with the mass-to-charge ratio of M/Z _1608.5 has an amino acid sequence shown as SEQ ID NO. 29;
the polypeptide with the mass-to-charge ratio of M/Z _1629.53 has an amino acid sequence shown in SEQ ID NO. 30;
the polypeptide with the mass-to-charge ratio of M/Z _1637.03 has an amino acid sequence shown in SEQ ID NO. 13;
the polypeptide with the mass-to-charge ratio of M/Z _1686.93 has an amino acid sequence shown as SEQ ID NO. 31;
the polypeptide with the mass-to-charge ratio of M/Z _1734.95 has an amino acid sequence shown as SEQ ID NO. 32;
the polypeptide with the mass-to-charge ratio of M/Z _1752.24 has an amino acid sequence shown in SEQ ID NO. 14;
the polypeptide with the mass-to-charge ratio of M/Z _1803.01 has an amino acid sequence shown in SEQ ID NO. 33;
the polypeptide with the mass-to-charge ratio of M/Z _1892.39 has an amino acid sequence shown in SEQ ID NO. 15;
the polypeptide with the mass-to-charge ratio of M/Z _1900.89 has an amino acid sequence shown as SEQ ID NO 16;
the polypeptide with the mass-to-charge ratio of M/Z _1909.6 has an amino acid sequence shown as SEQ ID NO: 16;
the polypeptide with the mass-to-charge ratio of M/Z _1932.05 has an amino acid sequence shown in SEQ ID NO 18;
the polypeptide with the mass-to-charge ratio of M/Z _1948.19 has an amino acid sequence shown as SEQ ID NO. 19;
the polypeptide with the mass-to-charge ratio of M/Z _2037.78 has an amino acid sequence shown as SEQ ID NO: 21;
the polypeptide with the mass-to-charge ratio of M/Z _2265.63 has an amino acid sequence shown in SEQ ID NO. 23;
the polypeptide with the mass-to-charge ratio of M/Z _2412.91 has an amino acid sequence shown as SEQ ID NO: 34;
the polypeptide with the mass-to-charge ratio of M/Z _2427.4 has an amino acid sequence shown as SEQ ID NO. 25;
the polypeptide with the mass-to-charge ratio of M/Z _2726.14 has an amino acid sequence shown as SEQ ID NO. 35;
the polypeptide with the mass-to-charge ratio of M/Z _2733.04 has an amino acid sequence shown in SEQ ID NO: 36;
the polypeptide with the mass-to-charge ratio of M/Z _2941.82 has an amino acid sequence shown as SEQ ID NO. 37;
the polypeptide with the mass-to-charge ratio of M/Z _3324.23 has an amino acid sequence shown in SEQ ID NO: 38.
In another aspect, the present application also provides a use of the marker for preparing an agent for diagnosing IgAN.
In another aspect, the present application provides a system for diagnosing IgAN comprising a reagent or device for detecting a marker described herein.
In another aspect, the present application provides a method for diagnosing IgAN, comprising detecting the presence or amount of a marker described herein in a urine sample from a subject.
The present application provides a diagnostic marker for Chronic Kidney Disease (CKD) and IgAN. The detected sample is urine, and the urine is required to be left and taken, so that no damage and risk are brought to a patient, and the method is very safe and reliable. The detection method is simple to implement and noninvasive, can greatly expand the range of applicable people, has no obvious contraindication during detection, can be used for repeated detection, and is suitable for detection of all patients. The screened polypeptide can be used as a novel biological marker for evaluating the kidney function progress of CKD or IgA nephropathy patients.
Other aspects and advantages of the present application will be readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application have been shown and described in the following detailed description. As those skilled in the art will recognize, the disclosure of the present application enables those skilled in the art to make changes to the specific embodiments disclosed without departing from the spirit and scope of the invention as it is directed to the present application. Accordingly, the descriptions in the drawings and the specification of the present application are illustrative only and not limiting.
Drawings
The specific features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates will be better understood by reference to the exemplary embodiments described in detail below and the accompanying drawings. The brief description of the drawings is as follows:
FIG. 1 shows a method for constructing a disease diagnosis model according to the present application.
FIG. 2 shows a screening method according to the Medium standard for characteristic peaks of diseases described in the present application.
FIG. 3 shows the AUC results of the ROC curve of the present application using 7 machine learning methods in the CKD diagnostic model.
FIG. 4 shows the AUC results of the ROC curve of the present application in the IgAN diagnostic model using 7 machine learning methods.
FIG. 5 shows the peak intensity variation of 31 characteristic CKD peaks in this application among healthy control, eGFR grades 1-2, eGFR grades 3-4 groups.
FIG. 6 shows that the intensities of the 6 characteristic CKD peaks M/Z _1250.48, M/Z _1900.89, M/Z _1909.6, M/Z _1975.09, M/Z _2925.93 and M/Z _6215.39 in this application are significantly increased/decreased between healthy control, eGFR 1-2 and eGFR 3-4, respectively.
FIG. 7 shows the peak intensity variation of 45 characteristic peaks of IgAN in this application between healthy control, lee grade 1-2 and Lee grade 3-4 groups.
FIG. 8 shows that the intensities of the 4 characteristic peaks M/Z _1752.24, M/Z _1932.05, M/Z _2427.4 and M/Z _6215.39 of IgAN in the present application are respectively and significantly increased/decreased between the healthy control, lee grade 1-2 and Lee grade 3-4.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification.
LASSO is called a last absolute shrinkage and selection operator, and is a kind of compression estimation. It obtains a more refined model by constructing a penalty function, so that it compresses some regression coefficients, i.e. the sum of the absolute values of the forcing coefficients is less than a certain fixed value; while some regression coefficients are set to zero. The advantage of subset puncturing is thus retained, and is a way to process biased estimates of data with complex collinearity.
Partial least squares discriminant analysis (PLS-DA) is a multivariate statistical analysis method used for discriminant analysis. Discriminant analysis is a common statistical analysis method that determines how the study object is classified based on observed or measured values of variables. The principle is that the characteristics of different processing samples (such as an observation sample and a comparison sample) are respectively trained to generate a training set, and the credibility of the training set is checked.
RFECV refers to finding the optimal number of features by cross-validation. Where RFE (Recursive feature elimination) refers to Recursive feature elimination, which is used to rank the importance of features. CV (Cross Validation) refers to Cross Validation, i.e., after a feature rating, an optimal number of features are selected by Cross Validation.
The machine learning method may include regression, classification, or a combination thereof. The term "machine learning" generally refers to algorithms that impart learning capabilities to a computer without explicit programming, including algorithms that learn from and make predictions about data. Machine learning algorithms used by embodiments disclosed herein may include, but are not limited to, random forest ("RF"), least absolute shrinkage and selection operator ("LASSO") logistic regression, regularized logistic regression, XGBoost, decision tree learning, artificial neural networks ("ANN"), deep neural networks ("DNN"), support vector machines, rule-based machine learning, and the like. Algorithms such as linear regression or logistic regression may be used as part of the machine learning process. However, it will be appreciated that the use of linear regression or another algorithm as part of the machine learning process is different from the implementation of statistical analysis, such as regression using a spreadsheet program.
Cross validation by ten folds, called 10-fold cross-validation by English name, is used for testing the accuracy of the algorithm. Is a commonly used test method. The data set was divided into ten parts, and 9 parts of the data set were used as training data and 1 part of the data set was used as test data in turn for the experiments. Each trial will yield a corresponding accuracy (or error rate). The average of the accuracy (or error rate) of the 10 results is used as an estimate of the accuracy of the algorithm, and generally 10-fold cross validation is performed multiple times (for example, 10 times of 10-fold cross validation), and then the average is obtained as an estimate of the accuracy of the algorithm. It should be noted that the ten-fold cross-validation accuracy correlates with but is not equivalent to the actual detection accuracy (or sensitivity). In the process of evaluating the effect of the test algorithm, the effect meets the ten-fold cross validation accuracy of the confidence interval, and if the correlation change appears along with the quantity of the characteristic polypeptides and the feasible value of clinical diagnosis is achieved, the mass spectrum model constructed by the polypeptides is shown to meet the requirement of clinical diagnosis.
Unless otherwise indicated, when reference is made herein to mass spectrometry or data presented in a graphical format (e.g., MALDI-TOF MS), the term "peak" refers to a peak or other characteristic caused by non-background noise that is recognizable to one of ordinary skill in the art.
As is well known to those skilled in the art, for any given molecule (e.g., a polypeptide), some variability in the appearance, intensity, and location of peaks in the mass spectrum may be caused by the apparatus used to obtain the mass spectrum (e.g., the type of ion source, the spatial and temporal characteristics of the ion beam), the humidity, the orientation of temperature, and other parameters. In the present case, the variability of ± 0.002ppm peak positions takes into account these possible variations without hindering the unambiguous identification of the indicated molecules. The identification of the molecules can be based on any unique distinct peak or combination thereof, typically the more prominent peak. Depending on the instrumentation used in this application, a margin of error of + -0.002 ppm may exist for the characteristic peak position.
The application also discloses the following specific embodiments:
1. a method of constructing a diagnostic model for Chronic Kidney Disease (CKD) based on MALDI-ToF MS data, the method comprising:
i) Performing characteristic peak screening on MALDI-ToF MS data of urine samples of CKD population and healthy population by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak meeting any one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of 30 th rank as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, selecting a peak with import score ranking of 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
iii) And establishing a CKD disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
2. The method according to embodiment 1, wherein the differential characteristic peaks between the CKD population and the healthy population are selected from peaks having the following mass to charge ratios: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24, M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1108.1,M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78, M/Z _1752.24,M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09, M/Z _1283.46,M/Z _1401.5,M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75, M/Z _6133.13,M/Z _1077.66,M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
3. The method according to embodiment 1, wherein the peaks in i) that meet any one of the following criteria are selected as candidate characteristic peaks: in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 20% is selected as a difference characteristic peak of CKD; selecting a peak with VIP score ranking 20 as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, the peak with import score ranking 20 is selected as the difference characteristic peak of CKD.
4. The method according to embodiment 3, wherein the differential characteristic peaks between the CKD population and the healthy population are selected from peaks having the following mass to charge ratios: M/Z1049.48, M/Z1157.14, M/Z1637.03, M/Z1948.19, M/Z4744.24, M/Z6215.39, M/Z1089.32, M/Z2265.63, M/Z1594.15, M/Z1900.89, M/Z3040.02, M/Z1108.1, M/Z2279.32, M/Z2925.93, M/Z1563.44, M/Z2078.53, M/Z1396.23, M/Z1909.6, M/Z1932.05, M/Z6248, M/Z1212.03, M/Z2031892.39, M/Z7.78, M/Z1752.24, M/Z6233.1250, M/Z627.46, M/Z1554, M/Z1554.9.46, M/Z1283.79, M/Z1554.9.3.3.53, M/Z1288.53, M/Z649.3.53, M/Z649.3.3.3, M/Z3.3.3.3, M/Z3.3.3.53, M/Z-D; wherein the calculated relative error of M/Z (ppm) is within 0.002.
5. The method according to any one of embodiments 3-4, wherein the characteristic peaks of difference between the CKD population and the healthy population are selected from the group consisting of peaks with the following mass-to-charge ratios: one or more of M/Z _1049.48, M/Z _1089.32, M/Z _1130.79, M/Z _1157.14, M/Z _1212.03, M/Z _1250.48, M/Z _1557.53, M/Z _1594.15, M/Z _1637.03, M/Z _1752.24, M/Z _1892.39, M/Z _1900.89, M/Z _1909.6, M/Z _1932.05, M/Z _1948.19, M/Z _2037.78, M/Z _2124.48, M/Z _2265.63, M/Z _2427.4, M/Z _3040.02, M/Z _4744.24, M/Z _6215.39 and M/Z _6233.15; wherein the calculated relative error of M/Z (ppm) is within 0.002.
6. The method according to embodiment 1, wherein the peaks in i) that meet any one of the following criteria are selected as candidate characteristic peaks: in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 30% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of top 10 as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, the peak with import score ranking 10 is selected as the difference characteristic peak of CKD.
7. The method of embodiment 6, wherein the differential characteristic peaks between the CKD population and the healthy population are selected from peaks having the following mass to charge ratios: one or more of M/Z _1049.48, M/Z _1157.14, M/Z _1637.03, M/Z _1948.19, M/Z _4744.24, M/Z _6215.39, M/Z _1089.32, M/Z _2265.63, M/Z _1594.15, M/Z _1900.89, M/Z _3040.02, M/Z _1909.6, M/Z _1932.05, M/Z _1250.48, M/Z _1212.03, M/Z _1892.39, M/Z _2124.48, and M/Z _2427.4; wherein the calculated relative error of M/Z (ppm) is within 0.002.
8. The method according to any one of embodiments 1-7, wherein said using a machine learning method to build a diagnostic model of CKD disease comprises using a machine learning method to build a diagnostic model of disease using a 5-fold repeated 10-fold cross validation modeling approach.
9. The method of embodiment 8, the machine learning method comprising: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
10. The method according to any one of embodiments 1 to 9, further comprising performing quality control processing and standardization processing on the MALDI-ToF MS data of the urine samples of the CKD population and the healthy population before screening the MALDI-ToF MS data of the urine samples of the CKD population and the healthy population.
11. The method according to embodiment 10, wherein the quality control process comprises: i) Quality control, ii) variance smoothing, iii) smoothing and baseline correction, and iv) Intensity correction.
12. The method of embodiment 10, wherein the quality control comprises that all of the tests contain the same number of data points and are not NA values.
13. The method of any of embodiments 11-12, wherein the variance smoothing comprises using a square root transform on raw mass spectral data.
14. The method of any of embodiments 11-13, wherein the smoothing and baseline correction comprises smoothing the spectra using a 21point Savitzky-Golay-Filter method and then removing baseline noise using a SNIP algorithm.
15. The method according to any of embodiments 11-14, wherein said correcting for Intensity comprises: the intensity value was balanced using a total ion current calibration.
16. The method according to any one of embodiments 11-15, wherein the quality control process comprises:
i) Quality control, including that all of the tests contain the same number of data points and are not NA values;
ii) variance smoothing, the variance smoothing comprising using a square root transform on the raw mass spectral data;
iii) Smoothing and baseline correction, the smoothing and baseline correction comprising smoothing the spectra using a 21point Savitzky-Golay-Filter method, and then removing baseline noise with a SNIP algorithm; and
iv) Intensity correction, said Intensity correction comprising: the intensity value was balanced using a total ion current calibration.
17. The method according to any of embodiments 10-16, wherein the normalization process comprises: i) Peak mass alignment, ii) peak identification and iii) peak merging;
wherein the peak quality alignment comprises identifying a first landmark peak present in a majority of the spectra, calculating a non-linear warping function for each spectrum by fitting a local regression to the matched reference peaks;
the peak identification includes identifying a peak having a peak intensity greater than twice a noise value (SNR) ≧ 2) as a signal peak;
the peak combining comprises combining signal peaks to one signal peak in the tolerance range of 0.002 ppm.
18. The method of any one of embodiments 10-17, the normalizing further comprising removing false positive peaks within a group that are less than 25% frequency.
19. The method according to any one of embodiments 1-18, comprising:
i) MALDI-ToF MS detection reading is carried out on urine samples of CKD people and healthy people to obtain fingerprint spectra of two groups of urine polypeptides;
ii) performing quality control treatment and standardized treatment on the urine polypeptide fingerprints of CKD (CKD disease) population and healthy population;
iii) Screening characteristic peaks of the fingerprint spectrums of the urine polypeptides of CKD people and healthy people by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak which accords with one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of 30 th rank as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, selecting a peak with import score ranking of 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
iii) And (3) establishing a CKD disease diagnosis model by using the identified difference characteristic peaks and using 7 machine learning methods including a Support Vector Machine (SVM), a Random Forest (RF), naive Bayes (NB), gradient Boosting (GBM), K-nearest neighbor (KNN), a conditional inference decision tree (ctre) and an adaptive enhancement (Adaboost).
20. The method of embodiment 19, further comprising using the AUC indicator to evaluate a disease diagnostic model.
21. The method of embodiment 20, wherein said using an AUC measure to evaluate a disease diagnosis model comprises validating the disease diagnosis model using MALDI-ToF MS data of an independent CKD population and a healthy population.
22. A method of constructing a diagnostic model for Chronic Kidney Disease (CKD) based on MALDI-ToF MS data, the method comprising:
i) Screening MALDI-ToF MS data of urine samples of CKD population and healthy population to obtain one or more different characteristic peaks between CKD population and healthy population with the following mass-to-charge ratios: <xnotran> M/Z _1049.48,M/Z _1157.14, M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32,M/Z _2265.63, M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1909.6,M/Z _1932.05,M/Z _1250.48, M/Z _1212.03,M/Z _1892.39,M/Z _2124.48 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002;
iii) And establishing a CKD disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
23. The method of embodiment 22, wherein the characteristic peak of difference between the CKD population and the healthy population is selected from the group consisting of: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39, M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1, M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6, M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24, M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46 M/Z _2427.4 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
24. The method according to any one of embodiments 22-23, wherein the characteristic peak of difference between the CKD population and the healthy population is selected from the group consisting of: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1108.1,M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78, M/Z _1752.24,M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09, M/Z _1283.46,M/Z _1401.5,M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75, M/Z _6133.13,M/Z _1077.66,M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
25. The method according to any one of embodiments 22-24, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: M/Z _1948.19, M/Z _1909.6, M/Z _1932.05, M/Z _2427.4 and M/Z _1637.03; wherein the calculated relative error of M/Z (ppm) is within 0.002.
26. The method according to any one of embodiments 22-25, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24, M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2124.48 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
27. The method according to any one of embodiments 22-26, wherein said characteristic peaks of difference between CKD and healthy populations comprise: M/Z-1049.48, M/Z-1089.32, M/Z-1130.79, M/Z-1157.14, M/Z-1212.03, M/Z _1250.48, M/Z _1557.53, M/Z _1594.15, M/Z _1637.03, M/Z _1752.24, M/Z _1892.39, M/Z-1900.89, M/Z-1909.6, M/Z-1932.05, M/Z-1948.19, M/Z-2037.78, M/Z-2124.48, M/Z-2265.63, M/Z-2427.4, M/Z-3040.02, M/Z-4744.24, M/Z-6215.39 and M/Z-6233.15; wherein the calculated relative error of M/Z (ppm) is within 0.002.
28. The method according to any one of embodiments 22-27, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24, M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1108.1,M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78, M/Z _1752.24,M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09, M/Z _1283.46 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
29. The method according to any one of embodiments 22-28, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1108.1,M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78, M/Z _1752.24,M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09, M/Z _1283.46,M/Z _1401.5,M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75, M/Z _6133.13,M/Z _1077.66,M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
30. The method according to any one of embodiments 22-29, wherein said screening for characteristic peaks of difference between CKD and healthy populations comprises:
i) Performing characteristic peak screening on MALDI-ToF MS data of urine samples of CKD population and healthy population by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak meeting one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of the top 30 in PLS-DA selection as a difference characteristic peak of CKD; in RFECV selection, selecting a peak with import score ranking of 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
31. the method according to any one of embodiments 22-30, wherein said using a machine learning approach to create a CKD disease diagnosis model comprises using a machine learning approach to create a disease diagnosis model using 5 iterations of 10-fold cross validation modeling.
32. The method of embodiment 31, the machine learning method comprising: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
33. The method according to any one of embodiments 22-32, wherein when the peak of the signature polypeptide M/Z _1948.19, M/Z _1909.6, M/Z _1932.05, M/Z _2427.4 is up-regulated, indicating that the urine sample is a positive sample, i.e., the patient is a CKD patient, the 10-fold cross-validation accuracy is not less than 90%.
34. The construct of any one of embodiments 22-33, wherein the CKD comprises IgA nephropathy and Non-IgA nephropathy.
35. Use of characteristic peaks based on MALDI-ToF MS data for the preparation of a diagnostic model of CKD, said diagnostic model of CKD being a machine learning type model, wherein said characteristic peaks are selected from peaks having the following mass-to-charge ratios: <xnotran> M/Z _1049.48, M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24,M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02,M/Z _1108.1,M/Z _2279.32, M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23,M/Z _1909.6,M/Z _1932.05, M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24,M/Z _6233.15, M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09,M/Z _1283.46 M/Z _2427.4 ; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
36. A diagnostic model of CKD based on MALDI-ToF MS data, the diagnostic model of CKD being a machine learning type model, the diagnostic model of CKD having a plurality of characteristic peaks of CKD selected from: one or more of M/Z _1049.48, M/Z _1089.32, M/Z _1130.79, M/Z _1157.14, M/Z _1212.03, M/Z _1250.48, M/Z _1557.53, M/Z _1594.15, M/Z _1637.03, M/Z _1752.24, M/Z _1892.39, M/Z _1900.89, M/Z _1909.6, M/Z _1932.05, M/Z _1948.19, M/Z _2037.78, M/Z _2124.48, M/Z _2265.63, M/Z _2427.4, M/Z _3040.02, M/Z _4744.24, M/Z _6215.39 and M/Z _ 6233.15.
37. The diagnostic model of embodiment 36, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: M/Z _1948.19, M/Z _1909.6, M/Z _1932.05, M/Z _2427.4 and M/Z _1637.03; wherein the calculated relative error of M/Z (ppm) is within 0.002.
38. The diagnostic model of any one of embodiments 36-37, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24, M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2124.48 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
39. The diagnostic model of any one of embodiments 36-38, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48,M/Z _1089.32,M/Z _1130.79,M/Z _1157.14,M/Z _1212.03, M/Z _1250.48,M/Z _1557.53,M/Z _1594.15,M/Z _1637.03,M/Z _1752.24,M/Z _1892.39, M/Z _1900.89,M/Z _1909.6,M/Z _1932.05,M/Z _1948.19,M/Z _2037.78,M/Z _2124.48, M/Z _2265.63,M/Z _2427.4,M/Z _3040.02,M/Z _4744.24,M/Z _6215.39 M/Z _6233.15; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
40. The diagnostic model of any one of embodiments 36-39, wherein said characteristic peaks of difference between CKD and healthy populations comprise: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24, M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1108.1,M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78,M/Z _1752.24,M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09, M/Z _1283.46 M/Z _2427.4; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
41. The diagnostic model of any one of embodiments 36-40, wherein the characteristic peaks of difference between the CKD population and the healthy population comprise: <xnotran> M/Z _1049.48,M/Z _1157.14,M/Z _1637.03,M/Z _1948.19,M/Z _4744.24, M/Z _6215.39,M/Z _1089.32,M/Z _2265.63,M/Z _1594.15,M/Z _1900.89,M/Z _3040.02, M/Z _1108.1,M/Z _2279.32,M/Z _2925.93,M/Z _1563.44,M/Z _2078.53,M/Z _1396.23, M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _1212.03,M/Z _1892.39,M/Z _2037.78, M/Z _1752.24,M/Z _6233.15,M/Z _1557.53,M/Z _2124.48,M/Z _1130.79,M/Z _1975.09, M/Z _1283.46,M/Z _1401.5,M/Z _1233.81,M/Z _6194.59,M/Z _1291.05,M/Z _1267.75, M/Z _6133.13,M/Z _1077.66,M/Z _2427.4,M/Z _2412.91,M/Z _2540.75,M/Z _3279.24 M/Z _1734.95; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002.
42. The diagnostic model of any of embodiments 36-41, the machine learning class model comprising: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
43. A method of diagnosing CKD, the method comprising: i) Obtaining a fingerprint of urine polypeptides from a urine sample from a subject, ii) selecting characteristic peaks of differences between the CKD population and the healthy population in the fingerprint, and inputting the characteristic peaks of differences into a diagnostic model for CKD as described in any one of embodiments 36-42, to obtain a probability of whether the patient suffers from CKD.
44. The method of embodiment 43, comprising using the AUC indicator to determine the probability of whether a subject suffers from CKD.
45. A diagnostic system for CKD comprising a computing unit that executes a diagnostic model of CKD according to any one of embodiments 36-42.
46. A method for constructing an IgA nephropathy (IgAN) diagnostic model based on MALDI-ToF MS data, the method comprising:
i) MALDI-ToF MS data of urine samples of IgAN people and Non-IgAN people are screened to obtain one or more different characteristic peaks with the following mass-to-charge ratios: <xnotran> M/Z _1049.48,M/Z _1212.03,M/Z _1948.19, M/Z _6215.39,M/Z _1594.15,M/Z _2941.82,M/Z _3279.24,M/Z _2265.63,M/Z _1637.03, M/Z _1089.32,M/Z _2427.4,M/Z _1734.95,M/Z _3040.02,M/Z _1267.75,M/Z _1909.6, M/Z _1932.05,M/Z _1250.48,M/Z _2037.78,M/Z _1157.14,M/Z _6233.15,M/Z _1892.39, M/Z _1130.79,M/Z _1900.89,M/Z _1233.81,M/Z _1557.53,M/Z _4744.24,M/Z _2124.48, M/Z _6133.13,M/Z _1394.04,M/Z _1608.59,M/Z _1629.53,M/Z _1686.93,M/Z _1752.24,M/Z _1803.01,M/Z _2412.91,M/Z _2585.93,M/Z _2601.72,M/Z _2726.14,M/Z _2733.04, M/Z _2999.97,M/Z _3021.7,M/Z _3208.35,M/Z _3286.12,M/Z _3324.23 M/Z _6177.18; </xnotran> Wherein the calculated relative error of M/Z (ppm) is within 0.002;
ii) establishing an IgAN disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
47. The method of embodiment 46, wherein said Non-IgAN population comprises a healthy population and a CKD population of Non-IgA nephropathy.
48. A method of constructing a model for identifying IgAN and Non-IgAN, the method comprising:
i) MALDI-ToF MS data of urine samples of IgAN people and Non-IgAN people are screened to obtain one or more different characteristic peaks between CKD people and healthy people with the following mass-to-charge ratios: M/Z _1394.04, M/Z _1608.59, M/Z _1629.53, M/Z _1637.03, M/Z _1686.93, M/Z _1752.24, M/Z _1803.01, M/Z _2412.91, M/Z _2585.93, M/Z _2601.72, M/Z _2726.14, M/Z _2733.04, M/Z _2941.82, M/Z _2999.97, M/Z _3021.7, M/Z _3040.02, M/Z _3208.35, M/Z _ 79.24, M/Z _3286.12, M/Z _3324.23, M/Z _6177.18, M/Z _ 323, M/Z _1714.8, M/Z _ 3273.68, M/Z _ 3228.191, M/Z _ 161263.191, M/Z _ 17163.63; wherein the calculated relative error of M/Z (ppm) is within 0.002;
ii) establishing a model for identifying the IgAN and the Non-IgAN by using the identified difference characteristic peak and adopting a machine learning method.
49. Use of characteristic peaks based on MALDI-ToF MS data for the preparation of a diagnostic model of CKD, said diagnostic model of CKD being a machine learning type model, wherein said characteristic peaks are selected from peaks having the following mass-to-charge ratios: M/Z1049.48, M/Z1212.03, M/Z1948.19, M/Z6215.39, M/Z1594.15, M/Z2941.82, M/Z3279.24, M/Z2265.63, M/Z1637.03, M/Z1089.32, M/Z2427.4, M/Z1734.95, M/Z3040.02, M/Z1267.75, M/Z1909.6, M/Z1932.05, M/Z1250.48, M/Z2037.78, M/Z1157.14, M/Z6233.15, M/Z1892.39, M/Z1130.79, M/Z _ 89.81, M/Z3.81, M/Z1557.53, M/Z _4744.24, M/Z _2124.48, M/Z _6133.13, M/Z _1394.04, M/Z _1608.59, M/Z _1629.53, M/Z _1686.93, M/Z _1752.24, M/Z _1803.01, M/Z _2412.91, M/Z _2585.93, M/Z _2601.72, M/Z _2726.14, M/Z _2733.04, M/Z _2999.97, M/Z _3021.7, M/Z _3208.35, M/Z _3286.12, M/Z _3324.23, and M/Z _6177.18; wherein the calculated relative error of M/Z (ppm) is within 0.002.
50. A MALDI-ToF MS data-based IgA nephropathy (IgAN) diagnostic model, the IgAN diagnostic model being a machine learning based model having a plurality of characteristic peaks of IgAN selected from the group consisting of: M/Z1049.48, M/Z1212.03, M/Z1948.19, M/Z6215.39, M/Z1594.15, M/Z2941.82, M/Z3279.24, M/Z2265.63, M/Z1637.03, M/Z1089.32, M/Z2427.4, M/Z1734.95, M/Z3040.02, M/Z1267.75, M/Z1909.6, M/Z1932.05, M/Z1250.48, M/Z2037.78, M/Z1157.14, M/Z6233.15, M/Z1892.39, M/Z1130.79, M/Z _ 89.81, M/Z1233.81, M/Z _1557.53, M/Z _4744.24, M/Z _2124.48, M/Z _6133.13, M/Z _1394.04, M/Z _1608.59, M/Z _1629.53, M/Z _1686.93, M/Z _1752.24, M/Z _1803.01, M/Z _2412.91, M/Z _2585.93, M/Z _2601.72, M/Z _2726.14, M/Z _2733.04, M/Z _2999.97, M/Z _3021.7, M/Z _3208.35, M/Z _3286.12, M/Z _3324.23, and M/Z _6177.18; wherein the calculated relative error of M/Z (ppm) is within 0.002.
51. The diagnostic model of embodiment 50, the machine learning class model comprising: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
52. A diagnostic system for IgAN comprising a calculation unit executing a diagnostic model for IgAN according to any of embodiments 50-51.
53. A method for diagnosing IgA nephropathy, comprising: inputting MALDI-ToF MS data of a subject urine sample into the IgA nephropathy diagnostic model according to any one of embodiments 50-51, and obtaining the probability of whether the patient suffers from IgAN.
54. The method of embodiment 53, comprising using the AUC indicator to determine the probability of whether a subject suffers from IgAN.
55. A model for identifying IgAN and Non-IgAN, wherein the model for identifying IgAN and Non-IgAN is a machine learning type model, and the model for identifying IgAN and Non-IgAN has a plurality of characteristic peaks, and the characteristic peaks are selected from the group consisting of: M/Z _1394.04, M/Z _1608.59, M/Z _1629.53, M/Z _1637.03, M/Z _1686.93, M/Z _1752.24, M/Z _1803.01, M/Z _2412.91, M/Z _2585.93, M/Z _2601.72, M/Z _2726.14, M/Z _2733.04, M/Z _2941.82, M/Z _2999.97, M/Z _3021.7, M/Z _3040.02, M/Z _3208.35, M/Z _3279.24, M/Z _3286.12, M/Z _3324.23, M/Z _6177.18, M/Z _ 1713, M/Z _1714.8, M/Z _ 3273.68, M/Z _ 3228.191, M/Z _ 2863.63; wherein the calculated relative error of M/Z (ppm) is within 0.002.
Without intending to be bound by any theory, the following examples are intended only to illustrate the methods, uses, and the like of the present application, and are not intended to limit the scope of the invention of the present application.
Examples
Example 1
1.1 urine sample pretreatment
Control group: 10ml of midstream urine of a healthy subject (48 persons) is collected in a collection tube, centrifuged at 3000rmp/min for 10min, and then the supernatant is taken. Shaking up and then placing at 4 ℃ for later use. ( Healthy control screening conditions: age, sex matched, eGFR Normal, microalbumin Normal )
Experimental groups: 10ml of midstream urine of a chronic kidney disease (194) diagnosed by pathological biopsy is collected in a collection tube, centrifuged at 3000rmp/min for 10min, and then a supernatant is taken. Shaking up and standing at 4 ℃ for later use.
1.2 MALDI-TOF detection of urine
1) Solution preparation: urine supernatant 2ul +18ul CHCA (concentration: 10 mg/ml) matrix liquid.
2) Sample application: the solution was mixed and then spotted immediately 2ul onto a clean target plate well. The target plate was placed on a 40 ℃ hot plate so that the solutes after spotting were evenly evaporated to dryness on the target plate.
3) Sample detection:
the instrument comprises the following steps: quanteof (MALDI-TOF) of the New generation of Qingdao Huazhi Bio-Corp
Detection of M/Z Range: 1000-10,000Da
And (3) standard substance: 757.40Da,1045.00Da,2465.20Da,3494.65Da and 5734.50Da.
1.3 Mass Spectrometry data processing
1) And after the data is corrected by the calibrator, exporting original data in a format of an M/ZML file.
2) Software: the raw mass spectral data was processed using the R software (version: v.4.0.3), the R package "MALDIquant Foreign" (version: 0.12) and "MALDIquant" (version: 1.19.3). The treatment process comprises the following steps:
(1) data reading: M/ZML raw data was read in using the import M/ZML function of the "MALDIquantForeign" R packet.
(2) And (3) quality control: we tested whether all spectra read in (spectra) contained the same number of data points and were not NA values.
(3) The variance is stable: we use a square root transform to simplify the visualization of the graph and overcome the potential dependency between variance and mean.
(4) Smoothing and baseline correction: followed by the use of 21point Savitzky-Golay-Filter [1] Methods were used to smooth the spectra. Then using SNIP algorithm [2] Baseline noise is removed. (Ryan et al, 1988) to correct basepine)
(5) Intensity correction: for better comparison and to overcome (very) small batch effects, we balanced the intensity values using Total-Ion-Current-Calibration (TIC).
(6) Peak mass alignment: is composed ofAlignment is necessary to compare peaks in different spectra (spectra). To match peaks belonging to the same mass, we used a statistical regression-based approach in conjunction with He et al (2011) [3] And Wang et al (2010) [4] The algorithm of (2). In particular, the first landmark peak that appears in most spectra (spectra) is identified. Then, a non-linear warping function is calculated for each spectrum by fitting a local regression to the matching reference peaks.
(6) Peak identification: we chose a lower SNR =2 (signal-to-noise ratio) to retain the identified signal peak as much as possible. When the SNR, i.e. the intensity of the signal, is greater than twice the noise level, the signal is identified as a peak.
(7) Peak merging: after alignment, the peak positions (masses) are very similar, but not exactly the same. Binning (bin) is required to make similar peak quality values the same. I.e., the mass values are very close but slightly different, consistent with tolerance in the 0.002ppm range, we will combine these peaks into one mass. For example, a peak of 1000m/z, with tolerance set at 0.002ppm, would represent peaks 998-1002m/z all at 1000m/z.
(8) Peak Table (feature matrix) false positive peaks less than 25% frequency within the group are then removed, reducing false positives identifying peaks. We obtain a table of peaks, each column representing a sample, each row representing the identified peak mass, and the number being the relative intensity of the peak. Missing values (undetected peaks) are interpolated from the corresponding spectra (spectra). Then, less than 25% frequency false positive peaks within the group will be removed, reducing the false positives of the identified peaks.
As shown in fig. 1, each group of samples (healthy population and CKD population) was divided into 2: the proportion of 1 is randomly divided into a training set and a verification set, and samples of the training set are used for identifying characteristic peaks. After the characteristic peak is identified, modeling is carried out by using the characteristic peak, a model is established by applying a modeling mode (7 machine learning methods) of 5 times of repeated 10-fold cross validation, and the effect of the model is evaluated by using an AUC index. Meanwhile, the model is extrapolated to an independent verification set to be used for verifying the distinguishing effectiveness of the model and is compared with the distinguishing effectiveness of clinical common indexes.
1.4 characteristic Peak identification
1) The test set (160 samples: 32 healthy controls, 128 patients with chronic kidney disease) were subjected to characteristic peak screening. Candidate signature peaks were screened according to the Strict standard, the Medium standard (shown in fig. 2) and the Loose standard, respectively.
Lasso selection PLS-DA selection RFECV selection
Strict Frequency of repeated occurrence>30% VIP score top 10 importance score top 10 ranking
Medium Frequency of recurrence>20% VIP score top 20 import score ranking 20 top
Loose Frequency of repeated occurrence>10% VIP score top 30 import score top 30
2) Screening candidate characteristic peaks meeting the frequency between groups, frequency >30% and AUC >60% simultaneously in 1) as characteristic peaks.
As shown in Table 1-1, a total of 42 peaks were screened as characteristic peaks of the difference between chronic kidney disease and healthy control group according to the Loose standard.
TABLE 1-1
Figure BDA0003779946000000291
/>
Figure BDA0003779946000000301
As shown in Table 1-2, a total of 18 peaks were selected as characteristic peaks of the difference between chronic kidney disease and healthy controls according to the Strict standard.
Tables 1 to 2
Figure BDA0003779946000000302
/>
Figure BDA0003779946000000311
As shown in tables 1 to 3, a total of 31 peaks were selected as characteristic peaks of the difference between chronic kidney disease and healthy control group according to the Medium standard.
Tables 1-3 characteristic peaks of differences between chronic kidney disease and healthy controls
Figure BDA0003779946000000312
/>
Figure BDA0003779946000000321
As shown in tables 2 to 3, 28 peaks were screened as the difference peak between the IgAN nephropathy and the healthy control, and 26 peaks were screened as the difference peak between the IgAN nephropathy and the Non-IgAN nephropathy.
TABLE 2 peaks of differences between IgAN nephropathy and healthy controls
Figure BDA0003779946000000322
/>
Figure BDA0003779946000000331
TABLE 3 peaks of differences between IgAN nephropathy and Non-IgAN nephropathy
Figure BDA0003779946000000332
Figure BDA0003779946000000341
The combination of the differential peaks in table 2 and table 3 is defined as IgAN-related differential peaks (table 4), and there are 45 peaks meeting the screening conditions. The peak of difference between IgAN nephropathy and healthy controls was included in the IgAN model, so that the IgAN model was able to distinguish healthy from IgAN humans, in addition to IgAN from Non-IgAN. The IgAN can be further graded in severity according to the 45 combined peaks.
TABLE 4 IgAN-related Difference peaks
Figure BDA0003779946000000342
/>
Figure BDA0003779946000000351
/>
Figure BDA0003779946000000361
1.5 building multiple diagnostic models
And (3) constructing a classification model for screening and diagnosing chronic nephropathy and an IgAN nephropathy differential diagnosis model by using 7 machine learning methods supporting svm, nb, ctree, rf, gbm, adaboost and knn and respectively using the characteristic peak related to chronic nephropathy and 45 characteristic peaks related to IgAN. And drawing ROC curves of the constructed different machine learning models, and using the area under the curve (AUC) to evaluate the performance of the classifier.
As shown in table 5, the analyzed peaks included by using different criteria all have good disease evaluation potential in CKD diagnostic models constructed by different machine learning methods, indicating that the characteristic peaks included in this range are reasonable.
TABLE 5
Figure BDA0003779946000000362
Figure BDA0003779946000000371
As shown in fig. 3, AUC was higher than 90% for all machine learning methods in the chronic kidney disease model when Medium inclusion peak criteria were used. The 7 models were validated with the validation set (82 samples: 12 healthy controls, 66 chronic kidney disease) and the AUC was higher than 97% for all models except ctree.
As shown in FIG. 4, the AUC of svm, nb and gbm models in the IgAN differential diagnosis model is close to 80%, and the AUC of svm model in the queue is 81.39% and the AUC of gbm model in the queue is 85.06%.
1.5.1 the constructed model is used to test New samples
Firstly, MALDI-TOF detection is carried out on a sample to be detected according to the method in the embodiment 1 to obtain a urine polypeptide group, and mass spectrum data processing is carried out on a spectrogram to obtain peak table information of a new sample. The peak table generated from the new sample may have a slightly different mass value from the one used for modeling (table 2 or table 4), and we match both masses according to tolerance =0.002 ppm. And then selecting the 31 CKD related peaks and 45 IgAN related peaks of the new sample to the established classification model for judgment to obtain the classification probability. And judging which type the new sample belongs to according to the optimal cutoff value of the established model, and further evaluating the risk of CKD or IgAN of the subject.
Example 2
2.1 differential characterization between Chronic nephropathy and healthy controls
Further annotation of 31 characteristic peaks as differences between chronic kidney disease and healthy control groups with LC-MS/MS resulted in the identification of 25 of the 31 characteristic peaks as protein fragments (distinguishing healthy controls from CKD patients) as shown in table 6.
TABLE 6 protein fragment identification of CKD-related difference peaks
Figure BDA0003779946000000372
/>
Figure BDA0003779946000000381
/>
Figure BDA0003779946000000391
/>
Figure BDA0003779946000000401
/>
Figure BDA0003779946000000411
/>
Figure BDA0003779946000000421
/>
Figure BDA0003779946000000431
/>
Figure BDA0003779946000000441
Wherein AUC represents the peak for identifying CKD Vs Control, and MH + represents the identification result of the traditional secondary mass spectrometry.
Fig. 5-6 show that there is a significant change in peak intensity between the control, eGFR grades 1-2, eGFR grades 3-4 groups for 31 selected CKD signature peaks, which can be used as signature peaks to distinguish CKD from healthy population.
2.2 Identification of distinct characteristic peaks between IgAN and Non-IgAN
Further, 45 differential peaks (distinguishing IgAN from Non-IgAN) correlated with IgAN were annotated by LC-MS/MS, and 31 () of the 45 characteristic peaks were identified as protein fragments as shown in Table 7.
TABLE 7 protein fragment identification of IgAN-related differential peaks
Figure BDA0003779946000000442
/>
Figure BDA0003779946000000451
/>
Figure BDA0003779946000000461
/>
Figure BDA0003779946000000471
/>
Figure BDA0003779946000000481
/>
Figure BDA0003779946000000491
/>
Figure BDA0003779946000000501
/>
Figure BDA0003779946000000511
/>
Figure BDA0003779946000000521
/>
Figure BDA0003779946000000531
/>
Figure BDA0003779946000000541
/>
Figure BDA0003779946000000551
/>
Figure BDA0003779946000000561
Wherein AUC beta Control-IgA represents the peak for identifying the IgA patient Vs Control, AUC beta IgAN-other CKD represents the peak for identifying the IgA nephropathy patient Vs CKD, and MH + represents the identification result by using the traditional secondary mass spectrometry.
FIGS. 7-8 show that 45 selected characteristic peaks of IgAN have significant changes in peak intensity among the control, lee-grade 1-2, and Lee-grade 3-4 groups, and can be used as characteristic peaks for distinguishing groups of IgAN and Non-IgAN.
[ REFERENCE ] to
[1]Savitzky,A.and Golay,M.J.E.(1964).Smoothing and difffferentiation of data by simplifified least squares procedures.Analytical Chemistry,36(8):1627–1639.
[2]Ryan,C.,Clayton,E.,Griffiffiffin,W.,Sie,S.,and Cousens,D.(1988).Snip,a statistics- sensitive background treatment for the quantitative analysis of pixe spectra in geoscience applications. Nuclear Instruments and Methods in Physics Research Section B:Beam Interactions with Materials and Atoms,34(3):396–402.
[3]He QP,et al.Self-calibrated warping for mass spectra alignment,Cancer Inform.,2011,vol. 10(pg.65-82).
[4]Wang B,et al.DISCO:distance and spectrum correlation optimization alignment for two- dimensional gas chromatography time-of-flight mass spectrometry-based metabolomics,Anal. Chem.,2010,vol.82(pg.5069-5081)。

Claims (10)

1. A method of constructing a diagnostic model for Chronic Kidney Disease (CKD) based on MALDI-ToF MS data, the method comprising:
i) Performing characteristic peak screening on MALDI-ToF MS data of urine samples of CKD population and healthy population by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak meeting any one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of 30 th rank as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, selecting a peak with import score ranking of 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
iii) And establishing a CKD disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
2. The method of claim 1, wherein the peaks in i) that meet any one of the following criteria are selected as candidate characteristic peaks: in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 20% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of top 20 as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, the peak with import score ranking top 20 is selected as the difference feature peak of CKD.
3. The method of claim 1, wherein the peaks in i) that meet any one of the following criteria are selected as candidate characteristic peaks: in the Lasso selection, 80% of samples of a test data set are randomly selected and repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 30% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of top 10 as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, the peak with import score ranking 10 is selected as the difference characteristic peak of CKD.
4. The method according to any one of claims 1-3, wherein the using a machine learning approach to build a CKD disease diagnosis model comprises using a machine learning approach to build a disease diagnosis model using 5 iterations of 10-fold cross validation modeling.
5. The method of claim 4, the machine learning method comprising: support Vector Machines (SVMs), random Forests (RFs), naive Bayes (NBs), gradient Boosting (GBMs), K-nearest neighbors (KNNs), conditional inference decision trees (ctres), and/or adaptive boosting (Adaboost).
6. The method according to any one of claims 1-5, the method comprising:
i) MALDI-ToF MS detection reading is carried out on urine samples of CKD people and healthy people to obtain fingerprint spectra of two groups of urine polypeptides;
ii) performing quality control treatment and standardized treatment on the urine polypeptide fingerprints of CKD (CKD disease) population and healthy population;
iii) Screening characteristic peaks of the fingerprint spectrums of the urine polypeptides of CKD people and healthy people by using three machine learning methods of minimum absolute shrinkage and selection operator Lasso, partial least squares discriminant analysis PLS-DA and cross validation recursive elimination RFECV, and selecting a peak which accords with one of the three machine learning methods as a candidate characteristic peak;
in the Lasso selection, 80% of samples of a test data set are randomly selected, repeated for 200 times, and a peak value with the repeated occurrence frequency of more than 10% is selected as a difference characteristic peak of CKD; selecting a peak with a VIP score of 30 th rank as a difference characteristic peak of CKD in PLS-DA selection; in RFECV selection, selecting a peak with import score ranking of 30 as a difference characteristic peak of CKD;
ii) screening candidate characteristic peaks satisfying both frequency >30% and AUC >60% of each group in i) as characteristic peaks;
iii) And (3) establishing a CKD disease diagnosis model by using 7 machine learning methods including a Support Vector Machine (SVM), a Random Forest (RF), naive Bayes (NB), gradient Boosting (GBM), K-nearest neighbor (KNN), a conditional inference decision tree (ctre) and an adaptive enhancement (Adaboost) by using the identified difference characteristic peaks.
7. The method of claim 6, further comprising using the AUC indicator to evaluate a disease diagnostic model.
8. The method of claim 7, wherein using the AUC indicator to evaluate a disease diagnosis model comprises validating the disease diagnosis model using MALDI-ToF MS data of an independent CKD population and a healthy population.
9. A method of constructing a diagnostic model for IgA nephropathy (IgAN) based on MALDI-ToF MS data, the method comprising:
i) MALDI-ToF MS data of urine samples of IgAN people and Non-IgAN people are screened to obtain one or more different characteristic peaks with the following mass-to-charge ratios: <xnotran> M/Z _1049.48,M/Z _1212.03,M/Z _1948.19,M/Z _6215.39,M/Z _1594.15,M/Z _2941.82,M/Z _3279.24,M/Z _2265.63,M/Z _1637.03,M/Z _1089.32,M/Z _2427.4,M/Z _1734.95,M/Z _3040.02,M/Z _1267.75,M/Z _1909.6,M/Z _1932.05,M/Z _1250.48,M/Z _2037.78,M/Z _1157.14,M/Z _6233.15,M/Z _1892.39,M/Z _1130.79,M/Z _1900.89,M/Z _1233.81,M/Z _1557.53,M/Z _4744.24,M/Z _2124.48,M/Z _6133.13,M/Z _1394.04,M/Z _1608.59,M/Z _1629.53,M/Z _1686.93,M/Z _1752.24,M/Z _1803.01,M/Z _2412.91,M/Z _2585.93,M/Z _2601.72,M/Z _2726.14,M/Z _2733.04,M/Z _2999.97,M/Z _3021.7,M/Z _3208.35,M/Z _3286.12,M/Z _3324.23 M/Z _6177.18; </xnotran>
Wherein the calculated relative error of M/Z (ppm) is within 0.002;
ii) establishing an IgAN disease diagnosis model by using the identified difference characteristic peak and adopting a machine learning method.
10. A model for identifying IgAN and Non-IgAN is a machine learning model, and has a plurality of characteristic peaks selected from the group consisting of: M/Z-1394.04, M/Z-1608.59, M/Z-1629.53, M/Z-1637.03, M/Z-1686.93, M/Z-1752.24, M/Z-1803.01, M/Z-2412.91, M/Z-2585.93, M/Z-2601.72, M/Z-2726.14, M/Z-2733.04, M/Z-2941.82, M/Z-2999.97, M/Z-3021.7, M/Z-3040.02, M/Z-3208.35, M/Z-3279.24, M/Z-3286.12, M/Z-3324.23, M/Z-6177.18, M/Z-323, M/Z-1714.8, M/Z-3273.68, M/Z-3228.191, M/Z-23563.191, M/Z-6163; wherein the calculated relative error of M/Z (ppm) is within 0.002.
CN202210926734.7A 2022-08-03 2022-08-03 Disease diagnosis model based on MALDI-ToF MS data Pending CN115966299A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210926734.7A CN115966299A (en) 2022-08-03 2022-08-03 Disease diagnosis model based on MALDI-ToF MS data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210926734.7A CN115966299A (en) 2022-08-03 2022-08-03 Disease diagnosis model based on MALDI-ToF MS data

Publications (1)

Publication Number Publication Date
CN115966299A true CN115966299A (en) 2023-04-14

Family

ID=87360525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210926734.7A Pending CN115966299A (en) 2022-08-03 2022-08-03 Disease diagnosis model based on MALDI-ToF MS data

Country Status (1)

Country Link
CN (1) CN115966299A (en)

Similar Documents

Publication Publication Date Title
US11315774B2 (en) Big-data analyzing Method and mass spectrometric system using the same method
US10713590B2 (en) Bagged filtering method for selection and deselection of features for classification
US10037874B2 (en) Early detection of hepatocellular carcinoma in high risk populations using MALDI-TOF mass spectrometry
US20130238251A1 (en) Method and system for detecting discriminatory data patterns in multiple sets of data
US20080086272A1 (en) Identification and use of biomarkers for the diagnosis and the prognosis of inflammatory diseases
US20120004854A1 (en) Metabolic biomarkers for ovarian cancer and methods of use thereof
Boskamp et al. A new classification method for MALDI imaging mass spectrometry data acquired on formalin-fixed paraffin-embedded tissue samples
JP6715451B2 (en) Mass spectrum analysis system, method and program
JP2004522980A (en) How to analyze a mass spectrum
Cuevas-Delgado et al. Data-dependent normalization strategies for untargeted metabolomics—A case study
CN114414704B (en) System, model and kit for evaluating malignancy degree or probability of thyroid nodule
CN112201356B (en) Construction method of oral squamous cell carcinoma diagnosis model, marker and application thereof
CN113514530A (en) Thyroid malignant tumor diagnosis system based on open ion source
WO2012107786A1 (en) System and method for blind extraction of features from measurement data
WO2024082581A1 (en) M protein detection method
Karimi et al. Identification of discriminatory variables in proteomics data analysis by clustering of variables
CN115966299A (en) Disease diagnosis model based on MALDI-ToF MS data
CN116106398A (en) Markers for diagnosing CKD
Webb-Robertson et al. A Bayesian integration model of high-throughput proteomics and metabolomics data for improved early detection of microbial infections
EP4337784A1 (en) Salivary metabolites are non-invasive biomarkers of hcc
Hajduk et al. The application of fuzzy statistics and linear discriminant analysis as criteria for optimizing the preparation of plasma for matrix-assisted laser desorption/ionization mass spectrometry peptide profiling
CN113960130A (en) Machine learning method for diagnosing thyroid cancer by adopting open ion source
CN117153384A (en) Urine polypeptide detection method
CN112037852A (en) Method and system for predicting lymph node metastasis of colorectal cancer at stage T1
CN117347643B (en) Metabolic marker combination for judging benign and malignant pulmonary nodule, screening method and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination