CN116087482B

CN116087482B - Biomarkers for severity typing of course of patients with 2019 novel coronavirus infection

Info

Publication number: CN116087482B
Application number: CN202310172516.3A
Authority: CN
Inventors: 周钢桥; 曹鹏博; 杜振华; 高成明; 李亦学
Original assignee: Academy of Military Medical Sciences AMMS of PLA; Guangzhou National Laboratory
Current assignee: Academy of Military Medical Sciences AMMS of PLA; Guangzhou National Laboratory
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-07-11
Anticipated expiration: 2043-02-24
Also published as: CN116087482A

Abstract

The present invention proposes a biomarker for classifying severity of course of patient with 2019 novel coronavirus infection, said biomarker comprising at least one selected from serum characteristic proteins and serum characteristic metabolites. The biomarker has good predictive performance, can effectively evaluate the severity of a patient with the COVID-19, can reflect the recovery condition of the patient, and is favorable for doctors to accurately predict the progress of diseases and perform clinical intervention in time.

Description

Biomarkers for severity typing of course of patients with 2019 novel coronavirus infection

Technical Field

The invention relates to the technical field of biomedical treatment, in particular to a biomarker for parting severity of disease course of 2019 novel coronavirus infected patients.

Background

2019 novel coronavirus disease (covd-19) is a novel respiratory and systemic disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Therefore, the method has very important significance for accurately judging the critical patients and the non-critical patients clinically.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art to at least some extent.

Since the epidemic situation of the COVID-19, a plurality of researches establish a machine learning classifier based on clinical characteristics and histology data, but the machine learning classifier is difficult to be effectively implemented due to the limitation of the sample size. The study integrates human serum proteomics and metabonomics data, and finds biomarkers which can be used for diagnosing 2019 novel coronavirus infection or judging severity of 2019 novel coronavirus infection patients based on a machine learning model of characteristic molecules (serum proteins and metabolites), and the biomarkers can be used for risk stratification of patients with COVID-19 and monitoring of disease processes.

Thus, in one aspect of the invention, the invention proposes a set of biomarkers. According to an embodiment of the invention, the biomarker comprises at least one selected from the group consisting of a serum-signature protein and a serum-signature metabolite, wherein the serum-signature protein comprises at least one of a dermato-dynamic protein (SRC 8), a myo-inositol-polyphosphatase (MINP 1), a serum amyloid A4 (SAA 4), a RAS oncogene family member RAP1B (RAP 1B), a filamin a (FLNA), a Matrix Gla Protein (MGP), a platelet-derived growth factor subunit B (PDGFB), an Attractin (ATRN), a talin 1 (TLN 1), a tropomyosin 4 (TPM 4) and a leukocyte-derived chemokine 2 (LECT 2); the serum trait metabolites include at least one of gamma-Glutamyl-beta-aminopropionitrile (gamma-Glutamyl-beta-aminopropionyl), 3-carboxythiomorpholine (Thiomorpholine 3-carboxylate), 2- (styryl) -1-3-dioxolane (2- (phenylmethyl) -1, 3-dioxalane), 3-hexadecyl-oleanolic acid (3-Hexadecanoyloleanolic acid), PC (16:1 (9Z)/2:0), 8 hydroxyeicosatetraenoic acid (8-HETE), archalylglycerol (archettringol-myo-inositol), PC (O-16:0/O-18:0), phenylalanyl-Isoleucine dipeptide (phenylnilyl-Isolinine), 4-Acetamido-2-aminobutyric acid (4-acetate-2-aminobutanoic acid), N-undecylglycine (N-Undelenyl-glycine), and desogestrel (etoposine). The biomarker has good predictive performance, can effectively diagnose or evaluate the severity of a patient with the COVID-19, can reflect the recovery condition of the patient, and is favorable for doctors to accurately predict the progress of diseases and perform clinical intervention in time.

According to an embodiment of the present invention, the biomarker may further comprise at least one of the following additional technical features:

according to an embodiment of the invention, the serum-characterized proteins comprise at least one of actin (SRC 8), inositol-polyphosphatase (MINP 1), serum amyloid A4 (SAA 4), RAS oncogene family members RAP1B (RAP 1B), filamin a (FLNA), matrix Gla Protein (MGP), platelet-derived growth factor subunit B (PDGFB), attractin (ATRN), ankyrin 1 (TLN 1) and tropomyosin 4 (TPM 4).

According to an embodiment of the invention, the serum characteristic metabolites include at least one of gamma-Glutamyl-beta-aminopropionitrile (gamma-Glutamyl-beta-aminopropionyl), 3-carboxythiomorpholine (Thiomorpholine 3-carboxylate), 2- (styryl) -1-3-dioxolane (2- (phenylhoisting) -1, 3-dioxane), 3-hexadecyl-oleanolic acid (3-Hexadecanoyloleanolic acid), PC (16:1 (9Z)/2:0), 8-hydroxyeicosatetraenoic acid (8-HETE), gulethylglycero-inositol (archettylglycinol), PC (O-16:0/O-18:0), phenylalanyl-Isoleucine dipeptide (phenylcalanyl-isolucine) and 4-Acetamido-2-aminobutyric acid (4-Acetamido-2-aminobutanoic acid).

According to an embodiment of the invention, the biomarker comprises at least one of gamma-Glutamyl-beta-aminopropionitrile (gamma-Glutamyl-beta-aminopropionyl), 3-carboxythiomorpholine (Thiomorpholine 3-carbonyl), 2- (styryl) -1-3-dioxolane (2- (phenylthesyl) -1, 3-dioxalane), dermato-lin (SRC 8), filamin a (FLNA), leukocyte-derived chemokine 2 (LECT 2), N-undecylglycine (N-Undecanoylglycine), etonogestrel (Etonogestrel), attractants (ATRN) and Desloratadine (Desloratadine).

According to an embodiment of the invention, the biomarkers are screened out through an algorithm and a machine learning model.

According to an embodiment of the invention, the algorithm is selected from at least one of a correlation algorithm, a recursive feature elimination algorithm, a genetic algorithm, a Boruta algorithm and an MMPC algorithm.

According to an embodiment of the invention, the algorithm is a correlation algorithm.

According to an embodiment of the invention, the machine learning model is selected from at least one of a random forest model, a k-nearest neighbor model, a single C5.0 decision tree model, and a partial least squares model.

According to an embodiment of the invention, the machine learning model is a random forest model. The inventor finds that the biomarker screened out after the related algorithm is integrated with the random forest model is optimal in the aspect of precision and accuracy of 2019 novel coronavirus infection prediction.

According to an embodiment of the invention, the machine learning model is a random forest model, the biomarkers include γ -Glutamyl- β -aminopropionitrile (γ -Glutamyl- β -aminopropionyl), thiomorpholine 3-carboxylate (Thiomorpholine 3-carboxylate), 2- (styryl) -1-3-dioxole (2- (phenylcathenyl) -1, 3-dioxalane), actin (SRC 8), filamin a (FLNA), leukocyte-derived chemokine 2 (LECT 2), N-undecylglycine (N-Undecanoylglycine), etonogestrel (Etonogestrel), attractin (ATRN) and Desloratadine (destatadine), the model parameters are set as follows: method=rf, mtry=4; the trail control parameter method=cv, number=5. According to the specific embodiment of the invention, under the model parameters, the patient with the COVID-19 can be effectively predicted or diagnosed through the 10 biomarkers, and the severity of the patient can be estimated, so that the patient with the COVID-19 has higher accuracy.

According to an embodiment of the invention, the machine learning model is a random forest model, the biomarker includes γ -Glutamyl- β -aminopropionitrile (γ -Glutamyl- β -aminopropionitol), 3-carboxythiomorpholine (Thiomorpholine 3-carboxylate), 2- (styryl) -1-3-dioxolane (2- (phenylcathenyl) -1, 3-dioxanate), 3-hexadecyl-oleanolic acid (3-Hexadecanoyloleanolic acid), PC (16:1 (9Z)/2:0), 8-hydroxyeicosatetraenoic acid (8-HETE), gulethylglycerol inositol (archaitidylerol-myo-inosol), PC (O-16:0/O-18:0), phenylalanyl-Isoleucine dipeptide (phenylglyoxyl-Isoleucine) and 4-Acetamido-2-aminobutyric acid (4-Acetamido-2-aminobutanoic acid), the following parameters are set as follows: method=rf, mtry=3; the trail control parameter method=cv, number=5. According to the specific embodiment of the invention, under the model parameters, the patient with the COVID-19 can be effectively predicted or diagnosed through the 10 biomarkers, and the severity of the patient can be estimated, so that the patient with the COVID-19 has higher accuracy.

According to an embodiment of the invention, the machine learning model is a random forest model, the biomarkers comprise actin (SRC 8), inositol-polyphosphate phosphatase (MINP 1), serum amyloid A4 (SAA 4), RAS oncogene family member RAP1B (RAP 1B), filamin a (FLNA), matrix Gla Protein (MGP), platelet derived growth factor subunit B (PDGFB), attractin (ATRN), ankyrin 1 (TLN 1) and tropomyosin 4 (TPM 4), the model parameters are set as follows: method=rf, mtry=2; the trail control parameter method=cv, number=5. According to the specific embodiment of the invention, under the model parameters, the patient with the COVID-19 can be effectively predicted or diagnosed through the 10 biomarkers, and the severity of the patient can be estimated, so that the patient with the COVID-19 has higher accuracy.

It should be noted that, in the present application, "method" refers to a modeling method, "rf" refers to a random forest (random forest), and "mtry" refers to an optimization parameter used for building a random forest machine learning model, that is, the number of variables used for binary tree in a node, "method" in a trace control parameter refers to a model verification method, "cv" refers to cross verification (cross verification), and "number" refers to the number of cross verification.

In another aspect, the invention provides a method of determining the source of a sample to be tested. According to an embodiment of the invention, the method comprises determining whether the sample source originates from 2019 or from severe 2019 new coronavirus infected patients based on the content of the aforementioned biomarkers in the sample to be tested and the aforementioned model, the machine learning model being as defined previously. The method has good prediction performance, and according to the method provided by the embodiment of the invention, whether the sample is derived from 2019 novel coronavirus infected patients or whether the sample is derived from severe 2019 novel coronavirus infected patients can be effectively judged, a foundation is laid for subsequent scientific researchers to continuously analyze samples to be detected, or the severity of the COVID-19 patients can be diagnosed or estimated clinically according to the method, the recovery condition of the patients is reflected, and the method is favorable for doctors to accurately predict the progress of diseases and perform clinical intervention in time.

According to an embodiment of the present invention, the method for determining a source of a sample to be tested may further include at least one of the following additional technical features:

according to an embodiment of the present invention, the sample to be tested is a serum sample.

In yet another aspect, the invention provides a system for determining the source of a sample to be tested. According to an embodiment of the invention, the system comprises an assay device for determining the content of the aforementioned biomarker in a sample to be tested; and the determining device is connected with the measuring device and is used for determining whether the source of the sample to be tested is derived from 2019 novel coronavirus infected patients or whether the sample to be tested is derived from severe 2019 novel coronavirus infected patients or not based on the content of the biomarker obtained in the measuring device and the model, wherein the machine learning model is defined in the previous description. The system according to the embodiment of the invention can run the method for determining the source of the sample to be tested, can effectively judge whether the sample is derived from 2019 novel coronavirus infected patients or whether the sample is derived from severe 2019 novel coronavirus infected patients, lays a foundation for subsequent scientific researchers to continuously analyze the sample to be tested, or can diagnose or evaluate the severity of the COVID-19 patient clinically according to the method, reflects the recovery condition of the patient, and is favorable for doctors to accurately predict the progress of diseases and perform clinical intervention in time.

In yet another aspect of the invention, the invention provides a method for classifying 2019 novel coronavirus infected patients. According to an embodiment of the invention, the method comprises classifying 2019 new coronavirus infected patients based on the content of the aforementioned biomarkers and the aforementioned model in the sample to be tested, which is derived from 2019 new coronavirus infected patients. The method has good prediction performance, can effectively evaluate the severity of the 2019 novel coronavirus infected patient, can reflect the recovery condition of the patient, and improves the accuracy of risk stratification for the 2019 novel coronavirus infected patient.

In a further aspect of the invention, the invention proposes the use of a biomarker as described previously in the manufacture of a kit for diagnosing 2019 a novel coronavirus infection or predicting the severity of 2019 a novel coronavirus infection.

In a further aspect of the invention, the invention proposes the use of a reagent for detecting a biomarker as described hereinbefore in the manufacture of a kit for diagnosing 2019 a novel coronavirus infection or predicting the severity of 2019 a novel coronavirus infection.

According to an embodiment of the invention, the reagent comprises a probe, an antibody, a small molecule compound that specifically recognizes the biomarker.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

The invention provides a group of biomarkers, wherein the biomarkers comprise serum characteristic proteins, serum characteristic metabolites and combinations of the serum characteristic proteins and the serum characteristic metabolites, healthy people and COVID-19 patient groups can be rapidly and effectively distinguished by measuring the content of the biomarkers and adopting the model of the invention, the severity of the COVID-19 patient groups can be accurately divided, and the method is favorable for doctors to accurately predict the progress of diseases and perform clinical intervention in time, achieves symptomatic treatment, and avoids real severe patients from being unable to be effectively treated in time.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a system diagram of determining the source of a sample to be tested according to an embodiment of the present invention;

FIG. 2 is a flow chart of a sample set analysis according to an embodiment of the present invention (where H in cohort 1 is Healthy control, M is Mild patient, S is Severe patient, clinical diagnosis is Severe at time T1 in cohort 2, and time T2 and time T3 are sampling time points of patients during treatment);

FIG. 3 is a diagram of non-targeted proteomic and metabolomic detection of queues 1 and 2 (where UMAP is a unified manifold approximation and projection) according to an embodiment of the present invention;

FIG. 4 is a machine learning model building flow chart according to an embodiment of the invention;

FIG. 5 is a graph of comparison of diagnostic performance in a training set for different "eigen-machine learning models" based on multiple sets of mathematical data (where Accurcy is Accuracy; log Loss is loss function; AUPR is area under Accuracy-recall curve; according to an embodiment of the invention);

FIG. 6 is a graph of top10 feature molecules ranked by contribution in a multi-component CA-RF model in accordance with an embodiment of the invention;

FIG. 7 is a rasterized density plot of the Precision-Recall (Recall) curve (left) and the subject's operational characteristics curve (right) for a multiple-study CA-RF model in accordance with an embodiment of the invention;

FIG. 8 is a graph of a confusion matrix in a validation dataset for a rRF model based on multiple sets of science, proteomics, and metabolomics according to an embodiment of the invention (where cells located diagonally from the top left to the bottom right represent correct predictions, and cells located diagonally from the bottom left represent incorrect predictions H, health control M, light patients (Mild patient), S, severe patients (Severe patient), prop, the proportion of cases in each cell. Light represents poor classification, dark represents good classification);

FIG. 9 is a bar graph of patient rRF model scores in a follow-up cohort according to an embodiment of the invention, including results from sampling at multiple time points for a total of 7 critically ill patients (where clinical diagnosis is severe at time T1, time T2 and time T3 are the time points of patient sampling during treatment;

FIG. 10 is a top10 feature molecular graph of a CA-RF model based on proteomic data in accordance with an embodiment of the present invention;

FIG. 11 is a top10 feature molecular graph of a CA-RF model based on metabonomics data in accordance with an embodiment of the invention;

FIG. 12 is a graph comparing multiple, metabolomic, proteomic rRF models constructed from top10 signature molecules (where Accuracy is Accuracy; mean_F1 is the Mean of the F1 scores (or F1 scores), and logLoss is the loss function) according to an embodiment of the invention.

Detailed Description

The invention provides a system for determining the source of a sample to be tested. As shown in fig. 1, the system includes an assay device 100 and a determination device 200. After the sample to be tested, i.e., the serum sample, enters the measuring device 100, the biomarkers in the serum sample are determined through an algorithm and a machine learning model in the measuring device. The determining means 200 is then able to determine the origin of the serum sample based on the biomarkers obtained in the measuring means 100.

Embodiments of the present invention are described in detail below. The following examples are illustrative only and are not to be construed as limiting the invention. The examples are not to be construed as limiting the specific techniques or conditions described in the literature in this field or as per the specifications of the product. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.

Examples

1. Collecting serum samples of the patient with the COVID-19 and healthy control and collecting and recording clinical data:

patients were divided into light (M) and heavy (S) groups according to the national health committee covd-19 guidelines for patient diagnosis and management (7 th edition): the clinical symptoms of the mild patients are mild and no pneumonia exists; patients with severe symptoms are characterized by dyspnea, the respiratory rate is more than or equal to 30 per minute (min), the blood oxygen saturation is less than or equal to 93 percent, and the ratio of the arterial blood oxygen partial pressure to the inhaled air oxygen concentration (PaO) ₂ /FiO ₂ ）<300, and/or 24 to 48 hours lung imaging examination lung infiltration greater than 50%.

At the same time, healthy subjects were also enrolled into healthy control group (H). In addition, separate patient cohorts were enrolled and blood samples from these patients were collected dynamically as a follow-up cohort. The peripheral blood collection time of the patient is 1-10 days after admission, and the serum sample is inactivated at 56 ℃ for 30 minutes so as to carry out the next histology analysis.

2. The method comprises the steps of obtaining proteomics and metabonomics data of serum samples of a study object by adopting a high performance liquid chromatography-mass spectrometry technology:

(1) Proteomic analysis:

1) The serum high-abundance proteins were removed and low-abundance proteins were enriched using a ProteoMinerTM kits (Bio-Rad, USA) kit.

2) The samples were sequentially subjected to reduction (10 mM dithiothreitol, 37 ℃,60 min), alkylation (40 mM iodoacetamide, room temperature, 30 min) and enzymatic hydrolysis (pancreatin: sample = 1:50, 37 ℃,12 h).

3) The enzymatic peptide was desalted by C18 (The Nest Group, USA) and lyophilized.

4) Preparing a library sample: (1) mixing the treated sample peptide solution in equal amount, dividing the sample peptide solution into 6 fractions, and vacuum freeze-drying; (2) the sample serum was mixed in equal amounts, and after removing the High abundance proteins (High Select ™ Top14 Abundant Protein Depletion Mini Spin Columns, thermo Fisher, USA), its peptide fragments were dissolved according to the previous method and separated into 18 fractions by a liquid chromatography system, followed by vacuum lyophilization treatment.

5) Mass spectrometry: (1) building a warehouse: the enzymatically hydrolyzed polypeptide samples were placed in formic acid and acetonitrile solutions containing iRT (Biognosys, schlieren, switzerland) standard peptides at a concentration of 0.1%, loaded on an EASY-nLC 1000 system (Thermo Fisher Scientific, USA) nano-upgrad liquid chromatograph at 5 μl, set at a flow rate of 300 nL/min, and separated on an analytical column. It was detected using positive ion scanning for 65 min. Data Dependent Analysis (DDA) was done using a Q exact ™ HF-X mass spectrometer (Thermo Fisher Scientific, USA). Primary mass spectrum scan range: 300-1800 m/z, mass spectrum resolution: 60000 (m/z 200), AGC target:3e6,Maximum IT:50 ms,DDA data were directly imported into Spectronaut (Version 14.10; biognosys AG, USA) software to construct a profile library. (2) Data independent pattern analysis (DIA): 2 mug peptide fragments (into which an appropriate amount of iRT standard peptide fragments were incorporated) were taken for each sample and subjected to DIA analysis using the same assay platform as used in the library building step. The DIA mode includes 60 variable scanning windows. The mass spectrum setting parameters are as follows: 120,000 (m/z 350-1250); nce=27%; AGC target=3e6; max it=60 ms. The cycle time was 3 seconds. The DIA data was imported into Spectronaut software for analysis.

(2) Metabonomic analysis:

1) Metabolite extraction: (1) hydrophobic metabolite extraction. 40. Mu.L of serum sample was added to 300. Mu.L of methanol solution containing internal standard (PC [12:0-13:0], PE [12:0-13:0], cer/Sph mixture I and FFA [19:0 ]), and after shaking for 2 min, 1000. Mu.L of methyl t-butyl ether and 250. Mu.L of deionized water were sequentially added to extract hydrophobic metabolite, and the mixture was lyophilized with liquid nitrogen and stored at-80 ℃. (2) Hydrophilic metabolite extraction. 100. Mu.L of serum samples were taken, 300. Mu.L of methanol with internal standard (TMAO-D9) was added, the supernatant was centrifuged, and the liquid nitrogen was lyophilized and stored at-80 ℃.

2) Mass spectrometry: a DionexTM MltiMate ™ 3000 Rapid Separation LC (RSLC) system (Thermo Scientific, USA) liquid chromatography system and Q exact ™ hybrid quadrupole Orbitrap mass spectrometer (Thermo Scientific, USA) mass spectrometry system were used in combination to detect the platform. The hydrophobic metabolite was detected by a ACQUITY UPLC BEH C8 (1.7 μm, 2.1 mm ×100 mm; waters, USA) column, and mass spectrometry was performed using an anion mode and a cation mode, respectively; hydrophilic metabolite detection mass spectrometry was performed using a UPLC BEH Amide column (2.1 mm X100 mm, 1.7 μm; waters, USA) column using a cationic mode. The mass spectrometry parameters were set as: scan range = m/z 150-1500; mass spectral resolution = 70,000; agc=3e6; max it=50 ms; the secondary mass spectrum resolution was 17,500. The data processing was performed using Xcalibur 2.2 SP1.48 software (Thermo Fisher Scientific, USA).

3. The data processing analysis adopts the process of extracting characteristic variables, constructing and selecting a model, adopting a Top10 variable reconstruction model and verifying the model to sequentially establish a machine learning model based on proteomics data, metabonomics data and integrating two groups of data. In addition, the role of the machine learning model in disease course monitoring is evaluated in a follow-up queue. Based on the sero-proteomic data and the metabonomic data, a machine learning model is constructed that can stratify risk of a patient with covd-19, and the specific process is as follows, as shown in fig. 4:

(1) Log2 transformation is carried out on proteomics data and metabonomics data, and median normalization and minimum filling are carried out, so that the proteomics data and metabonomics data are used as a data set constructed by a machine learning model.

(2) 154 subjects were randomly divided into 109 training sets (70%) and 45 test sets (30%).

(3) Five algorithms were used to screen Feature variable combinations (FP): correlation algorithms (Correlation algorithm, CA), recursive feature elimination (Recursive feature elimination, RFE), genetic algorithms (Genetic algorithm, GA), borata Algorithm (BA) and MMPC algorithms (Max-min parents and children algorithm, MMPC).

(4) For each specialThe combination of the symptom variables constructs four Machine learning Models (ML): random forest models (RF), K-nearest neighbor models (K-nearest neighbors, KNN), single C5.0tree models (c5.0tree), and partial least squares models (Partial least squares, PLS). The parameters of each FP-ML combination were optimized using a 5-fold cross-validated basic grid search algorithm. In order to overcome the problem of unbalanced sample size of different study groups when constructing an RF model, corresponding weights w are assigned to different study groups _c =N/(kN _c ) Wherein c is a group, N _c For the c groups of samples, k is the number of groups and N is the total number of samples. The FP-ML, CA-RF, with the best precision-recall area under the curve (area under the precision recall curve, AUPR) and lowest log loss was selected for further optimization.

(5) Model variables are prioritized based on the CA-RF model, and the top10 variables (top 10) are further selected to construct an RF model (i.e., rRF model).

(6) The rRF model was validated in the validation dataset and evaluated for its disease course monitoring ability in the follow-up queue 2.

Specific data and results are described below for study queue recruitment cases 1. Study cohort 1 included healthy controls (H, n=30), light (M, n=42) and severe (S, n=82). Study cohort 2 included 7 critically ill patients, and 15 peripheral blood samples were dynamically collected as shown in fig. 2.

2. Serological testing. 717 serum metabolites and 628 serum proteins (stably detected in 50% of patients in at least one study group) were identified by proteomic analysis and metabonomic analysis. Unified manifold approximation and projection (Uniform manifold approximation and projection, UMAP) analysis showed that both histology data better differentiated H, M, S populations, as shown in fig. 3.

3. And (5) screening characteristic molecules and constructing a machine model. First, study cohort 1 was divided into training set (n=109) and validation set (n=45). Then 5 molecular Feature combinations (FP) were obtained using 5 Feature molecular screening algorithms (correlation algorithm (Correlation algorithm, CA), recursive Feature elimination (Recursive Feature elimination, RFE), genetic algorithm (Genetic algorithm, GA), boruta Algorithm (BA), and MMPC algorithm (Max-min parents and children algorithm, MMPC)), CA-FP (n=774 Feature molecules), RFE-FP (n=4 Feature molecules), GA-FP (n=281 Feature molecules), BA-FP (n=162 Feature molecules), MMPC-FP (n=7 Feature molecules), respectively. Thereafter, 24 "molecular feature-machine learning" models were constructed by 4 machine learning model algorithms (Random forest model (RF), K-nearest neighbor model (K-nearest neighbors, KNN), single C5.0tree model (C5.0 tree) and partial least squares model (Partial least squares, PLS)), as shown in fig. 5, and by comparison, the model with the best precision-recall area under the curve (area under the precision recall curve, AUPR) and the lowest log loss was selected as the best model, i.e., the RF-CA model, for optimization.

Further, according to the contribution degree of model variables, as shown in FIG. 6, the top10 characteristic molecules (including gamma-Glutamyl-beta-aminopropionate, thiomorpholine 3-carboxylate,2- (phenylthanyl) -1, 3-dioxanate, SRC8, FLNA, LECT2, N-Undelanoglycine, etOnogestrel, ATN and Desloatadine) of the RF-CA model are taken to reconstruct the RF model (namely, the rRF model is used, the model construction still uses a caraet package, 5-fold cross validation, the optimization parameter mtry=4 is used, the sample weight is used, and other default parameters are used), and finally the micro-average AUPR (micro-average AUPR) and AUROC (micro-average AUROC) of rRF are 0.7693 (95% CI, 0.7570-0.7708) and 0.9997 (95% CI, 0.9997-0.9998), respectively (shown in FIG. 7). Finally, the accuracy of rRF is verified in the verification set, and the confusion matrix is shown as shown in FIG. 8, and the accuracy of classifying 45 study objects in the test set by the multiple groups of the metabonomics rRF models and the proteomics rRF models is found to be 100%, and the accuracy of classifying 45 study objects by the proteomics rRF models is found to be 91.84%, which indicates that the three models can accurately classify the samples in the verification set. Furthermore, analysis of the samples in cohort 2 found that the rRF model score (probability of samples falling under severe-probability of falling under mild) for 71.42% (5/7) subject samples showed a steadily decreasing trend with patient recovery, suggesting that the rRF model could also have potential as a means of clinical course monitoring (fig. 9).

To further verify the above-described multiple-group metabolic rRF model, F models based on proteomic top10 characteristic molecules (including SRC8, MINP1, SAA4, RAP1B, FLNA, MGP, PDGFB, ATRN, TLN1, TPM 4) and metabonomic top10 characteristic molecules (including gamma-Glutaminyl-beta-aminopropionate, 2- (phenylcathenyl) -1, 3-dioxane, 3-Hexadecanoyloleanolic acid, PC (16:1 (9Z)/2:0), 8-HETE, archalidyl-myo-inoisitol, PC (O-16:0/O-18:0)) were constructed respectively using the same analytical procedure, and the multiple-group metabolic rRF models were further trained by comparing the rRF models of the multiple groups of rRF, F1, 3-dioxanyl) -1, 3-dioxanyl, PC (16:1 (9Z)/2:0), 8-HETE, archalidyl-myo-inoisitol, PC (O-16:0), phenylalanyl-Isoleucine,4-Acetamido-2-aminobutanoic acid) as shown in FIG. 12, and thus the rRF model was better than the single-group rRF model was trained by comparing the metabolic scores of the multiple groups of rRF models.

Therefore, the random forest model constructed based on proteomics, metabonomics and integrated two groups of top10 characteristic molecules can be used for identifying heavy and light COVID-19 patients, has good identification performance on heavy and light COVID-19 patient samples, and can improve the accuracy of risk stratification on the COVID-19 patients.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A set of biomarkers for diagnosing 2019 a novel coronavirus infection severity, said biomarkers comprising actin, inositol-polyphosphate phosphatase, serum amyloid A4, RAS oncogene family member RAP1B, filamin a, matrix Gla protein, platelet-derived growth factor subunit B, attractin, talin 1, and tropomyosin 4.

2. A set of biomarkers for diagnosing 2019 a novel coronavirus infection severity, said biomarkers comprising gamma-glutamyl-beta-aminopropionitrile, 3-carboxythiomorpholine, 2- (styryl) -1-3-dioxolane, 3 hexadecyl oleanolic acid, PC (16:1 (9Z)/2:0), 8 hydroxy eicosatetraenoic acid, gulethylglycero inositol, PC (O-16:0/O-18:0), phenylalanyl-isoleucine dipeptide, and 4-acetamido-2-aminobutyric acid.

3. A set of biomarkers for diagnosing 2019 a novel coronavirus infection severity, said biomarkers comprising gamma-glutamyl-beta-aminopropionitrile, 3-carboxythiomorpholine, 2- (styryl) -1-3-dioxolane, actin, filamin a, leukocyte-derived chemotactic agent 2, n-undecylglycine, etonogestrel, attractants, and desloratadine.

4. A biomarker according to any of claims 1 to 3, wherein the biomarker is screened out by an algorithm and a machine learning model.

5. The biomarker according to claim 4 wherein the algorithm is selected from at least one of a correlation algorithm, a recursive feature elimination algorithm, a genetic algorithm, a Boruta algorithm and an MMPC algorithm.

6. The biomarker of claim 4, wherein the algorithm is a correlation algorithm.

7. The biomarker of claim 4, wherein the machine learning model is selected from at least one of a random forest model, a k-nearest neighbor model, a single C5.0 decision tree model, and a partial least squares model.

8. The biomarker of claim 4, wherein the machine learning model is a random forest model.

9. The biomarker of claim 4, wherein the machine learning model is a random forest model, the biomarker comprising γ -glutamyl- β -aminopropionitrile, 3-carboxythiomorpholine, 2- (styryl) -1-3-dioxolane, actin, filamin a, leukocyte-derived chemotactic 2, n-undecylglycine, etonogestrel, attractants, and desloratadine, the model parameters being set as follows: method=rf, mtry=4; the trail control parameter method=cv, number=5.

10. The biomarker of claim 4, wherein the machine learning model is a random forest model, the biomarker comprises γ -glutamyl- β -aminopropionitrile, 3-carboxythiomorpholine, 2- (styryl) -1-3-dioxolane, 3 hexadecyl oleanolic acid, PC (16:1 (9Z)/2:0), 8 hydroxyeicosatetraenoic acid, gulethylglycerinositols, PC (O-16:0/O-18:0), phenylalanyl-isoleucine dipeptide and 4-acetamido-2-aminobutyric acid, the model parameters are set as follows: method=rf, mtry=3; the trail control parameter method=cv, number=5.

11. The biomarker of claim 4, wherein the machine learning model is a random forest model, the biomarker comprising actin, myo-inositol-polyphosphatase, serum amyloid A4, RAS oncogene family member RAP1B, filamin a, matrix Gla protein, platelet derived growth factor subunit B, attractin, talin 1, and tropomyosin 4, the model parameters being set forth below: method=rf, mtry=2; the trail control parameter method=cv, number=5.

12. A system for determining the source of a sample to be tested, comprising:

the measuring device is used for determining the content of the biomarker according to any one of claims 1-3 in a sample to be measured;

the determining device is connected with the determining device and is used for determining whether the source of the sample to be detected is from 2019 novel coronavirus infected patients or from severe 2019 novel coronavirus infected patients based on the content of the biomarker obtained in the determining device and a machine learning model, wherein the machine learning model is defined in any one of claims 9-11, and the sample to be detected is a serum sample.

13. Use of the biomarker of any of claims 1-3 in the preparation of a kit for diagnosing 2019 novel coronavirus infection or predicting the severity of 2019 novel coronavirus infection.

14. Use of a reagent for detecting a biomarker according to any of claims 1 to 3, in the preparation of a kit for diagnosing 2019 novel coronavirus infection or predicting 2019 novel coronavirus infection severity.

15. The use of claim 14, wherein the agent comprises a probe, an antibody, a small molecule compound that specifically recognizes the biomarker.