WO2022166934A1 - 心血管病发病风险评估肠道菌群标志物及其应用 - Google Patents

心血管病发病风险评估肠道菌群标志物及其应用 Download PDF

Info

Publication number
WO2022166934A1
WO2022166934A1 PCT/CN2022/075241 CN2022075241W WO2022166934A1 WO 2022166934 A1 WO2022166934 A1 WO 2022166934A1 CN 2022075241 W CN2022075241 W CN 2022075241W WO 2022166934 A1 WO2022166934 A1 WO 2022166934A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
bacteroides
unclassified
intestinal flora
heart disease
Prior art date
Application number
PCT/CN2022/075241
Other languages
English (en)
French (fr)
Inventor
杨跃进
董超然
杨进刚
朱海波
许靖
Original Assignee
中国医学科学院阜外医院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110157590.9A external-priority patent/CN112509635A/zh
Priority claimed from CN202110157645.6A external-priority patent/CN112509701A/zh
Priority claimed from CN202110157644.1A external-priority patent/CN112509700A/zh
Application filed by 中国医学科学院阜外医院 filed Critical 中国医学科学院阜外医院
Publication of WO2022166934A1 publication Critical patent/WO2022166934A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders

Definitions

  • the invention belongs to the field of biomedical technology, and specifically relates to a disease detection technology using intestinal flora as a marker for risk assessment of cardiovascular disease and its related applications.
  • Cardiovascular disease mainly refers to coronary atherosclerotic heart disease, referred to as coronary heart disease (CAD).
  • CAD coronary heart disease
  • the mainstream view is that cardiovascular disease, including coronary heart disease, is a type of immunometabolic disease, as well as a type of systemic, progressive, and inflammatory disease.
  • the main lesions are atherosclerotic plaque formation and inflammatory progression, with essential features including lipid deposition and accumulation of inflammatory cells resulting in a nonbacterial inflammatory response known as metabolic inflammation. Because in the process of atherosclerotic plaques and progression, from lipid streaks to atherosclerotic plaques, to rupture, and multiple links leading to thrombosis, there are always various inflammatory cells and a large number of inflammatory mediators involved.
  • TC total cholesterol
  • high blood pressure hypertension
  • diabetes diabetes
  • age all risk factors associated with cardiovascular disease. Risk assessment of cardiovascular disease.
  • the intestinal mucosa is the largest immune-active organ in the body.
  • the tens of billions of bacteria deposited in the intestine are called "gut microbiota", and the host provides the intestinal microflora with an appropriate environment and necessary nutrients.
  • the gut microbiota is involved in regulating various functions of the human body, such as providing metabolic nutrients to the host, participating in growth promotion and immune regulation, eliminating pathogenic microorganisms, maintaining the integrity of the intestinal barrier and normal homeostasis.
  • the gut microbiota plays a source-regulatory role in human immune-inflammatory and metabolic diseases, and is closely associated with the presence of metabolic inflammation and insulin resistance, atherosclerosis, obesity, and diabetes.
  • An object of the present invention is to provide a set of markers associated with the risk of cardiovascular disease.
  • Another object of the present invention is to provide a method for establishing a cardiovascular disease risk assessment model.
  • Another object of the present invention is to provide a cardiovascular disease risk assessment model.
  • Another object of the present invention is to provide a cardiovascular disease risk assessment device.
  • Another object of the present invention is to provide a method for assessing the risk of cardiovascular disease.
  • the inventors of the present case have determined a set of biomarkers related to the risk of cardiovascular disease, which include multiple intestinal flora. By detecting these intestinal bacteria in samples from individuals The relevant information of the group can be a good assessment of the risk of cardiovascular disease in individuals.
  • the present invention provides an application of a reagent for detecting individual information in preparing a cardiovascular disease risk assessment device (assessment system), wherein the individual information includes intestinal flora information, and the intestinal
  • the flora includes at least 10 kinds of intestinal bacteria, and the intestinal bacteria are differential bacteria screened based on the metagenomic data of the intestinal flora of patients with cardiovascular disease and healthy people.
  • the cardiovascular disease is stable coronary heart disease, acute coronary syndrome or acute coronary syndrome for stable coronary heart disease.
  • the intestinal flora includes: Bacteroides massiliensis, unclassified eagle Eggerthella unclassified, Klebsiella pneumoniae, Oscillibacter unclassified, Paraprevotella unclassified, Lachnospiraceae bacterium_5_1_63FAA , Anaerostipes hadrus, Bilophila unclassified, Roseburia hominis, Eubacterium ventriosum, Prevotella copri, intestinal Barnesiella intestinihominis, Bacteroides xylanisolvens, Eubacterium hallii, Megamonas unclassified, Bacteroides plebeius, Parabacteroides distasonis, Escherichia coli.
  • the intestinal flora includes: Bifidobacterium longum, Lachnospiraceae bacterium_5_1_63FAA (Lachnospiraceae bacterium_5_1_63FAA), Alistipes onderdonkii, Collinsella aerofaciens, Eubacterium eligens, Faecalibacterium prausnitzii, Bacteroides vulgatus ), Oscillibacter unclassified, Bacteroides ovatus, and Eubacterium ventriosum.
  • the intestinal flora includes: Bifidobacterium longum ( Bifidobacterium longum), Streptococcus anginosus, Coprococcus comess, Collinsella aerofaciens, Faecalibacterium prausnitzii, Bacteroides ovatus, Anaerotruncus colihominis, Bacteroides fragilis, Holdemania filiformis, Eubacterium rectale, and Streptococcus salivarius.
  • the weight of each bacteria in the intestinal flora is determined according to the following characteristic importance values , or, the weight ratio of each bacteria in the intestinal flora is: Bacteroides marseii, 23; Unclassified Eaglezella, 19; Klebsiella pneumoniae, 16; Unclassified Clostridium, 15; Unclassified Paralevobacter, 15; Lachnospira_5_1_63FAA, 13; Corynebacterium faecalis, 11; Unclassified Biliophilus, 10; R. hominis, 8; Prevotella, 8; Pasteurella enterica, 6; Escherichia coli, 5; Eubacterium, 5;
  • each bacteria in the intestinal flora is weighted according to the following feature importance: Bifidobacterium longum, 47; Helicobacteriaceae_5_1_63FAA, 44; Alternaria, 43; Collinsia aerogenes, 32; Eubacterium, 31; Faecalibacterium prevotii, 30; Bacteroides vulgaris, 28; Bacteroidetes, 20; Eubacterium phleiformis, 14.
  • each bacteria in the intestinal flora is as follows
  • the feature importance value determines the weight, or the weight ratio of each bacteria in the intestinal flora is: Bifidobacterium longum, 13; Streptococcus angina, 11; ; Faecalibacterium prevotii, 9; Bacteroides ovale, 8; Clynebacterium anaerobes, 8; Bacteroides fragilis, 7; Haldermannella, 6; Eubacterium rectum, 4; Streptococcus salivarius, 4.
  • each of the intestinal flora used as a marker in the present invention is a risk factor for the onset of cardiovascular disease.
  • the higher the abnormality of each risk factor the greater the difference in the expression abundance of each intestinal bacteria compared with healthy people), the higher the risk of individual cardiovascular disease.
  • the individual information may further include one or more of total cholesterol level, hypertension, diabetes, and age.
  • the techniques of the present invention are particularly suitable for assessing the risk of developing stable coronary heart disease in individuals from East Asian populations.
  • the present invention provides a cardiovascular disease risk assessment device, which includes a detection unit and a data analysis unit, wherein:
  • the detection unit is used to detect individual information and obtain a detection result; wherein, the individual information is the same as the individual information described in any one of claims 1-5;
  • the data analysis unit is used for analyzing and processing the detection result of the detection unit.
  • the detection unit includes any reagent material that can obtain the information of each characteristic bacteria in the intestinal flora of the individual to be tested, and any of the prior art can be used.
  • a feasible method can detect the information of each characteristic bacteria in the intestinal flora of the individual to be tested.
  • the detection unit includes reagent materials for detecting DNA data of stool samples.
  • the process that the data analysis unit is used to analyze and process the detection result of the detection unit includes:
  • the characteristic data of the intestinal flora is determined.
  • the method when the data analysis unit analyzes and processes the detection result of the detection unit, the method includes: matching the detection result of individual information with a weight coefficient to calculate the The risk assessment score of the individual to be tested.
  • the embodiments of the present invention provide a method for establishing an onset risk assessment (prediction) model of stable coronary heart disease, so that the established model is used to assess the onset risk of stable coronary heart disease, To improve evaluation accuracy, the method includes:
  • the characteristic data of intestinal flora is determined, and the biomarkers of stable coronary heart disease are pre-screened according to the historical information of the relative abundance of differential bacteria.
  • the relative abundance history information of the differential bacteria is obtained by performing differential analysis on the relative abundance history information of stable coronary heart disease patients and healthy people;
  • the intestinal flora characteristic data is input into a pre-established machine learning model for training to obtain a stable coronary heart disease risk assessment model.
  • the method for establishing an onset risk assessment model for stable coronary heart disease provided in the embodiment of the present invention further includes:
  • the performance of the machine learning model is evaluated using the AUROC indicator.
  • the present invention also provides a method for evaluating the onset risk of stable coronary heart disease using a stable coronary heart disease risk assessment model qualified for performance evaluation.
  • An embodiment of the present invention provides a device for establishing an onset risk assessment model of stable coronary heart disease, which is used to perform risk assessment on stable coronary heart disease, so as to improve the evaluation accuracy of the established model, and the device includes:
  • the DNA data acquisition module is used to obtain DNA data of stool samples from patients with stable coronary heart disease and healthy people;
  • a paired-end sequencing processing module is used to perform paired-end sequencing processing on the fecal sample DNA data to obtain intestinal flora metagenomic data;
  • An annotation analysis module for performing species annotation analysis and functional annotation analysis on the gut microbiota metagenomic data to obtain relative abundance information of patients with stable coronary heart disease and healthy people;
  • the characteristic data determination module is used to determine the characteristic data of intestinal flora according to the relative abundance information and the pre-screened biomarkers of stable coronary heart disease, and the biomarkers of stable coronary heart disease are based on the relative abundance of differential bacteria.
  • the abundance history information is pre-screened, and the relative abundance history information of the differential bacteria is obtained by performing differential analysis on the relative abundance history information of stable coronary heart disease patients and healthy people;
  • the model training module is used for inputting the intestinal flora characteristic data into a pre-established machine learning model for training, so as to obtain a stable coronary heart disease risk assessment model.
  • the device for establishing an onset risk assessment model for stable coronary heart disease further includes:
  • Parameter adjustment module for utilizing GridSearchCV algorithm and Hyperopt algorithm to carry out parameter adjustment to described machine learning model
  • the model testing module is used to test the parameter-adjusted machine learning model using the test data
  • the performance evaluation module is used to evaluate the performance of the machine learning model by using the AUROC indicator according to the test results.
  • the embodiments of the present invention provide a method for establishing a risk prediction model for acute coronary syndrome, so that the established model can be used to predict the risk of acute coronary syndrome, so as to improve the accuracy of prediction rate, the method includes:
  • Species annotation analysis and functional annotation analysis were performed on the metagenomic data of gut microbiota qualified for quality assessment, and the relative abundance information of patients with acute coronary syndrome and healthy people was obtained;
  • the characteristic data of intestinal flora is determined, and the biomarkers of acute coronary syndrome are determined according to the historical information of relative abundance of differential bacteria.
  • the relative abundance history information of the differential bacteria is obtained by performing differential analysis on the relative abundance history information of patients with acute coronary syndrome and healthy people;
  • the intestinal flora characteristic data is input into a pre-established machine learning model for training, and an acute coronary syndrome risk prediction model is obtained.
  • the present invention also provides a method for predicting the risk of acute coronary syndrome using the acute coronary syndrome risk prediction model.
  • a device for establishing a risk prediction model for acute coronary syndrome is also provided, so that the established model is used to predict the risk of acute coronary syndrome, and the prediction accuracy is improved.
  • the device includes:
  • DNA data acquisition module for acquiring DNA data of stool samples from patients with acute coronary syndrome and healthy people
  • a paired-end sequencing processing module is used to perform paired-end sequencing processing on the fecal sample DNA data to obtain intestinal flora metagenomic data;
  • the data trimming module is used to use Trimmomatic software to remove the joints in the gut microbiota metagenomic data, and trim the joints-removed gut microbiota metagenomic data according to the pre-set base quality value;
  • Quality assessment module for quality assessment of trimmed gut microbiota metagenomic data using FastQC software
  • the annotation analysis module is used to perform species annotation analysis and functional annotation analysis on the qualified gut microbiota metagenomic data to obtain the relative abundance information of patients with acute coronary syndrome and healthy people;
  • the historical information of relative abundance of bacteria is pre-screened, and the historical information of relative abundance of differential bacteria is obtained by differential analysis of historical information of relative abundance of acute coronary syndrome patients and healthy people;
  • the model training module is used for inputting the intestinal flora characteristic data into a pre-established machine learning model for training to obtain an acute coronary syndrome risk prediction model.
  • the embodiments of the present invention provide a method for establishing an acute coronary syndrome risk prediction (assessment) model for stable coronary heart disease, so as to use the established model to synthesize acute coronary syndromes Risk prediction can be carried out by means of levy to improve the prediction accuracy.
  • the method includes:
  • Paired-end sequencing is performed on the DNA data of the screened fecal samples to obtain the metagenome data of intestinal flora;
  • the characteristic data of intestinal flora is determined, and the biomarkers of acute coronary syndrome are determined according to the historical information of relative abundance of differential bacteria.
  • the relative abundance history information of the differential bacteria is obtained by differentially analyzing the relative abundance history information of patients with acute coronary syndrome and patients with stable coronary heart disease;
  • the present invention also provides a method for predicting acute coronary syndrome risk for stable coronary heart disease using the acute coronary syndrome risk prediction model.
  • the embodiment of the present invention provides a device for establishing an acute coronary syndrome risk prediction model for stable coronary heart disease, so that the established model can be used to predict the risk of acute coronary syndrome and improve the prediction accuracy,
  • the device includes:
  • the DNA data acquisition module is used to obtain DNA data of stool samples from patients with acute coronary syndrome and patients with stable coronary heart disease;
  • a concentration data determination module used for determining the total amount data and the total concentration data of the DNA data of the stool sample by using an agarose gel method
  • a DNA data screening module configured to compare the total amount data and the total concentration data with a preset threshold, and screen the stool sample DNA data according to the comparison result;
  • the paired-end sequencing processing module is used to perform paired-end sequencing processing on the screened fecal sample DNA data to obtain intestinal flora metagenomic data;
  • An annotation analysis module for performing species annotation analysis and functional annotation analysis on the gut microbiota metagenomic data to obtain relative abundance information of patients with acute coronary syndrome and patients with stable coronary heart disease;
  • the historical information of relative abundance of bacteria is pre-screened, and the historical information of relative abundance of differential bacteria is obtained by differential analysis of the historical information of relative abundance of patients with acute coronary syndrome and patients with stable coronary heart disease;
  • the model training module is used for inputting the intestinal flora characteristic data into a pre-established machine learning model for training to obtain an acute coronary syndrome risk prediction model.
  • a cardiovascular disease risk assessment device which includes: a risk assessment module for performing cardiovascular disease risk assessment using a cardiovascular disease risk assessment model qualified for performance evaluation .
  • the present invention also provides another computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor is implemented when the processor executes the computer program : Obtain individual cardiovascular disease risk assessment results based on the individual information to be tested;
  • the individual information is the same as the aforementioned individual information of the present invention.
  • the present invention also provides another computer-readable storage medium, where the computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed, realizes: obtaining an individual's cardiovascular system based on the information of the individual to be measured Disease risk assessment results;
  • the individual information is the same as the aforementioned individual information of the present invention.
  • the characteristics of the intestinal flora of patients with cardiovascular disease are fully considered, and a machine learning algorithm is used to screen non-invasive biomarkers that can be used to assess and monitor the risk of cardiovascular disease from complex and cumbersome biological big data. Improve the accuracy of assessment and make up for the blank of clinical early warning of cardiovascular disease.
  • FIG. 1 is a schematic diagram of a risk assessment method for stable coronary heart disease in an embodiment of the present invention
  • Fig. 2 is the AUROC curve diagram in the training set in the embodiment of the present invention.
  • FIG. 3 is a schematic diagram of the biomarkers of stable coronary heart disease that play an important role in the model screened in the embodiment of the present invention
  • FIG. 4 is a structural diagram of a risk assessment device for stable coronary heart disease in an embodiment of the present invention.
  • FIG. 5 is an AUROC curve diagram of a stable coronary heart disease risk assessment model in another embodiment.
  • FIG. 6 is an AUROC curve diagram in a training set of risk prediction of acute coronary syndrome in a specific embodiment of the present invention.
  • FIG. 7 is a schematic diagram of the screened biomarkers of acute coronary syndrome that play an important role in the model according to a specific embodiment of the present invention.
  • FIG. 8 is an AUROC curve diagram of an acute coronary syndrome risk assessment model according to another specific embodiment of the present invention.
  • FIG. 9 is an AUROC curve diagram in a training set of acute coronary syndrome risk prediction for stable coronary heart disease in an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of the biomarkers for acute coronary syndrome with stable coronary heart disease that are screened in the embodiment of the present invention and play an important role in the model.
  • FIG. 11 is an AUROC curve diagram of an acute coronary syndrome risk assessment model for stable coronary heart disease according to another specific embodiment of the present invention.
  • Embodiment 1 Risk assessment of stable coronary heart disease
  • an embodiment of the present invention provides a method for establishing a risk assessment model for stable coronary heart disease. As shown in FIG. 1 , the method may include:
  • Step 101 obtaining DNA data of stool samples of patients with stable coronary heart disease and healthy people;
  • Step 102 performing paired-end sequencing processing on the fecal sample DNA data to obtain intestinal flora metagenomic data
  • Step 103 performing species annotation analysis and functional annotation analysis on the intestinal flora metagenomic data to obtain relative abundance information of patients with stable coronary heart disease and healthy people;
  • Step 104 Determine intestinal flora characteristic data according to the relative abundance information and the pre-screened biomarkers of stable coronary heart disease, and the biomarkers of stable coronary heart disease are based on the historical information of the relative abundance of differential bacteria.
  • the relative abundance history information of the differential bacteria is obtained by performing differential analysis on the relative abundance history information of stable coronary heart disease patients and healthy people;
  • Step 105 inputting the intestinal flora characteristic data into a pre-established machine learning model for training to obtain a stable coronary heart disease risk assessment model
  • Step 106 utilize GridSearchCV algorithm and Hyperopt algorithm to carry out parameter adjustment to described machine learning model
  • Step 107 using the test data to test the parameter-adjusted machine learning model
  • Step 108 According to the test result, use the AUROC indicator to evaluate the performance of the machine learning model.
  • the present invention also provides a method for assessing the risk of developing stable coronary heart disease, the method comprising:
  • Step 109 using the stable coronary heart disease risk assessment model qualified for the performance evaluation to assess the incidence risk of stable coronary heart disease.
  • the DNA data of stool samples of patients with stable coronary heart disease and healthy people are obtained; the DNA data of the stool samples are subjected to paired-end sequencing to obtain the metagenomic data of intestinal flora. ; Perform species annotation analysis and functional annotation analysis on the gut microbiota metagenomic data to obtain the relative abundance information of patients with stable coronary heart disease and healthy people; according to the relative abundance information and pre-screened stable coronary heart disease to determine the characteristic data of intestinal flora, the biomarkers of stable coronary heart disease are pre-screened according to the historical information of the relative abundance of differential bacteria, and the historical information of the relative abundance of differential bacteria is for the stable type.
  • the relative abundance history information of coronary heart disease patients and healthy people is obtained by differential analysis; the intestinal flora characteristic data is input into the pre-established machine learning model for training, and the stable coronary heart disease risk assessment model is obtained; using the GridSearchCV algorithm and Hyperopt algorithm to adjust the parameters of the machine learning model; use the test data to test the machine learning model after parameter adjustment; according to the test results, use the AUROC index to evaluate the performance of the machine learning model; use the stable type that is qualified for performance evaluation.
  • Coronary heart disease risk assessment model for the risk assessment of stable coronary heart disease.
  • the embodiments of the present invention fully consider the characteristics of intestinal flora of patients with stable coronary heart disease, and use machine learning algorithms to screen non-invasive biomarkers that can be used to assess and monitor the risk of stable coronary heart disease from complex and cumbersome biological big data. , improve the accuracy of assessment, and make up for the blank of clinical early warning of stable coronary heart disease.
  • DNA data of stool samples of stable coronary heart disease patients and healthy people are obtained.
  • the agarose gel method is used to determine the total data and total concentration data of the DNA data of the stool samples; the total data and the total concentration data are determined.
  • the total concentration data is compared with a preset threshold; the DNA data of the stool sample is screened according to the comparison result.
  • paired-end sequencing is performed on the DNA data of the stool sample to obtain the metagenome data of intestinal flora.
  • Trimmomatic software is used to remove the joints in the intestinal flora metagenomic data, and according to the pre-set base quality value, the intestinal flora macrophages with the joints removed are analyzed. Trimming the genome data; using FastQC software to assess the quality of the trimmed gut microbiota metagenomic data; performing species annotation analysis and functional annotation analysis on the gut microbiota metagenomic data, including: Qualifying the gut microbiota Species annotation analysis and functional annotation analysis of tract microbiota metagenomic data.
  • fecal samples are collected from patients after receiving the project test, and stored in dry ice within 30 minutes, and stored in a -80°C refrigerator as soon as possible for testing.
  • DNA was extracted, and the quality of the extracted nucleic acid was controlled by agarose gel method. The total amount of DNA was required to be ⁇ 1 ⁇ g and the total DNA concentration was ⁇ 20 ng/ ⁇ L.
  • a library was constructed for the samples with qualified quality, and then the DNA data of the stool samples were subjected to illumina. Hiseq4000 paired-end sequencing, obtained the paired-end sequencing data of each sample, and stored it as a FASTQ file.
  • FASTQ is a text format that stores biological sequences (usually nucleic acid sequences) and corresponding quality assessments, all encoded in ASCII, almost the standard format for high-throughput sequencing.
  • Trimmomatic software was used to perform quality control on the data, ie trimming and removing adapters and low-quality sequences from the original data.
  • Trimmomatic is a popular Illumina platform data filtering tool that supports multi-threading and fast data processing. It is mainly used to remove adapters in Fastq sequences and trim Fastq based on base quality values. It includes paired-end sequencing and single-end sequencing modes, supports gzip and bzip2 compressed files at the same time, and also supports conversion between phred-33 and phred-64 formats.
  • FastQC is a Java-based software that enables rapid quality assessment of sequencing data. For the filtered data, FastQC software was used to evaluate the quality of the data after quality control.
  • the quality of FASTQ sequencing files can be judged. If the quality of the FASTQ sequencing file is qualified, then proceed to the subsequent data analysis; otherwise, redo the adjustment parameters and trim the paired-end sequencing data using Trimmomatic software. It should be noted that each base of the sequenced sequence corresponds to a quality value (represented by letters or symbols, which can be converted to an ASCII value minus 64), and this quality value represents the accuracy of the detected base. If the general quality value of this sequence is low or the average quality value is less than 20, or if there are many N, it is also considered a low-quality sequence.
  • species annotation analysis and functional annotation analysis are performed on the gut microbiota metagenomic data to obtain relative abundance information of patients with stable coronary heart disease and healthy people.
  • performing species annotation analysis and functional annotation analysis on the gut flora metagenome data includes: downloading a gut flora database, where the gut flora database includes a plurality of reference genomes, and the reference genomes Including: bacteria, archaea, viruses and eukaryotes; according to the intestinal flora database, use MetaPhIAn2 software to perform species annotation analysis on the intestinal flora metagenomic data, and use HUMAnN2 software to analyze the intestinal flora metagenomic data.
  • Functional annotation analysis includes: downloading a gut flora database, where the gut flora database includes a plurality of reference genomes, and the reference genomes Including: bacteria, archaea, viruses and eukaryotes; according to the intestinal flora database, use MetaPhIAn2 software to perform species annotation analysis on the intestinal flora metagenomic data, and use HUMAnN2 software to analyze the intestinal flora metagenomic data.
  • MetaPhIAn2 metagenome species annotation analysis was performed using MetaPhIAn2 software for the data after quality control.
  • MetaPhIAn2 has compiled more than 17,000 reference genomes, including 13,500 bacteria and archaea, 3,500 viruses and 110 eukaryotes.
  • the software can be used to achieve accurate taxonomic assignment and accurate calculation of the relative abundance of species. It achieves species-level precision, as well as strain-level identification and tracking.
  • the species abundance information of the gut microbiota was obtained to establish a model for evaluation.
  • the R software package vegan is used to analyze species diversity
  • the input file is gut flora species abundance data.
  • LEfSe LDA Effect Size
  • the characteristic data of intestinal flora in coronary heart disease here is the abundance data of differential bacterial species obtained from LEfSe analysis.
  • the characteristic data of intestinal flora is determined according to the relative abundance information and the pre-screened biomarkers of stable coronary heart disease, and the biomarkers of stable coronary heart disease are based on the relative abundance history of differential bacteria.
  • the information is pre-screened, and the relative abundance history information of the differential bacteria is obtained by differential analysis of the relative abundance history information of stable coronary heart disease patients and healthy people.
  • the biomarkers of stable coronary heart disease are pre-screened in the following manner: using the Boruta feature selection package to perform feature selection on the historical information of the relative abundance of differential bacteria to determine the biomarkers of stable coronary heart disease.
  • the Boruta feature selection package is used to perform feature selection on the relative abundance history information of the differential bacteria as follows: create a shadow feature matrix according to the relative abundance history information of the differential bacteria; determine the real feature according to the shadow feature matrix data and shadow feature data; according to the real feature data and shadow feature data, determine the importance label corresponding to the historical information of the relative abundance of each differential bacteria; according to the importance label, characterize the historical information of the relative abundance of the differential bacteria choose.
  • the pre-screened biomarkers for stable coronary heart disease include: Bacteroides massiliensis, Eggerthella unclassified, Klebsiella pneumoniae, and Clostridium unclassified Oscillibacter unclassified, unclassified Paraprevotella unclassified, Lachnospiraceae bacterium_5_1_63FAA, Anaerostipes hadrus, unclassified Bilophila unclassified, Eubacterium ventriosum, Prevotella copri, Roseburia hominis, Barnesiella intestinihominis, Bacteroides xylanisolvens, Eubacterium hallii, Bacteroides plebeius, Megamonas unclassified, Parabacteroides distasonis, Escherichia coli Escherichia coli.
  • the historical information on the relative abundance of the differential bacteria is obtained by performing a differential analysis on the historical information on the relative abundance of stable coronary heart disease patients and healthy people, including: the historical information on the relative abundance of the differential bacteria is obtained by using LDA Effect Size software was obtained by differential analysis of the relative abundance historical information of stable coronary heart disease patients and healthy people.
  • the boruta algorithm is used for feature selection.
  • the goal of Boruta is to select all feature sets related to the dependent variable, rather than selecting the feature set that can minimize the model cost function for a specific model.
  • the significance of the Boruta algorithm is that it can help the present invention to more comprehensively understand the influencing factors of dependent variables, so as to perform better and more efficient feature selection.
  • Boruta is a feature selection package in python. After installing the package, input the historical information of the relative abundance of differential bacteria to obtain important features suitable for modeling.
  • the intestinal flora characteristic data is input into a pre-established machine learning model for training to obtain a stable coronary heart disease risk assessment model.
  • the parameters of the machine learning model were adjusted using the GridSearchCV algorithm and the Hyperopt algorithm.
  • the performance of the machine learning model is evaluated using the AUROC indicator.
  • inputting the intestinal flora characteristic data into a pre-established machine learning model for training includes: inputting the intestinal flora characteristic data into a pre-established LightGBM machine learning model for training.
  • GridSearchCV grid search
  • LightGBM is a more powerful and faster model than Xgboost, and its performance has been greatly improved. Compared with traditional algorithms, it has advantages: faster training efficiency, low memory usage, higher accuracy, and support for parallelized learning. , can handle large-scale data.
  • Hyperopt is used to further optimize the parameters of the new model. Hyperopt is a tool for adjusting parameters through Bayesian optimization. This method is faster and has better results. In addition, Hyperopt combined with MongoDB can perform distributed parameter adjustment and quickly find relatively optimal parameters.
  • the lightgbm package in python is used to construct a model for LightGBM machine learning.
  • the model mainly consists of two algorithms: Gradient One-Side Sampling (GOSS) and Mutually Exclusive Feature Binding (EFB).
  • GOSS from the perspective of reducing samples: Exclude most of the samples with small gradients, and only use the remaining samples to calculate the information gain. Each data instance has a different gradient. According to the definition of calculating information gain, instances with large gradients have a greater impact on information gain. Therefore, when sampling, try to retain samples with large gradients (pre-set threshold, or the highest percentage). between bits), randomly remove samples with small gradients.
  • EFB from a feature reduction perspective: bundling mutually exclusive features, that is, replacing them with a synthetic feature, especially in sparse feature spaces, where many features are almost mutually exclusive (eg, many features will not be non-zero values at the same time).
  • mutually exclusive features can be bundled, the bundled problem can be reduced to a graph coloring problem, and an approximate solution can be obtained by a greedy algorithm. More specifically, the relevant parameters can be set as follows:
  • gbdt is the gradient boosting tree
  • learning_rate is the weight reduction coefficient of each weak learner
  • num_leaves is the output one-hot vector (length) of each basic learner
  • max_depth is the maximum depth of the decision tree
  • the value range is (0,1]
  • colsample_bytree is used to control the proportion of the number of columns randomly sampled by each tree.
  • GridSearchCV and Hyperopt are packages provided in python, and the present invention performs parameter tuning after installing these packages in python.
  • the name of GridSearchCV can actually be split into two parts, GridSearch and CV, namely grid search and cross-validation.
  • Grid search the search is for parameters, that is, within the specified parameter range, adjust the parameters in turn by step size, use the adjusted parameters to train the learner, and find the parameter with the highest accuracy on the validation set from all parameters. This is actually A training and comparison process.
  • Hyperopt is a class library for "distributed asynchronous algorithm configuration/hyperparameter optimization" in python. Using it, the present invention can automatically obtain the best hyperparameters by relying on the complicated hyperparameter optimization process.
  • a model with hyperparameters can be regarded as an inevitable non-convex function, so hyperopt can almost stably obtain more reasonable parameter tuning results than manual ones. Especially for models with more complex parameter tuning, it can also achieve the final performance far faster than manual parameter tuning.
  • the full name of AUROC is "area under the receiver operating characteristic curve", which is often used as an index for evaluating the predictive ability of the model.
  • a binary prediction may have 4 outcomes: the invention predicts 0, and the true class is 0: this is called True Negative (TN, True Negative); the invention predicts 0, and the true class is 1: this is called false Negative (FN, False Negative); the invention predicts 1, and the true class is 0: this is called a false positive (FP, False Positive); the invention predicts 1, and the true class is 1: this is called a true positive ( TP, True Positive).
  • TPR true positive rate
  • the false positive rate ie, the false positive rate
  • FP/(FP+TN) The false positive rate
  • This indicator corresponds to the proportion of negative data points that were mistaken for positive data points to all negative data points. In other words, the higher the FPR, the more negative data points the present invention misclassifies.
  • the present invention In order to combine FPR and TPR into one indicator, the present invention first calculates the logistic regression of the first two indicators based on different thresholds (for example: 0.00; 0.01, 0.02, ..., 1.00), and then plots them as an image, where the FPR value is the horizontal axis, and the TPR value is the vertical axis.
  • the obtained curve is the ROC curve, and the index considered in the present invention is the AUC of the curve, which is called AUROC.
  • the diagonal dashed line is the ROC curve of the random predictor: AUROC is 0.5. Random predictors are often used as a baseline to test whether the model is useful. The higher the AUROC, the better the predictive ability of the model.
  • stable CAD group stable plaque group
  • sCAD stable CAD group
  • NCA normal coronary artery group
  • Inclusion criteria for the study population stable coronary heart disease (old myocardial infarction, history of PCI, stable angina pectoris, or "healthy people” without clinical ischemic symptoms, and coronary CT/angiography found to have coronary stenosis >50%).
  • Acute coronary syndrome ACS
  • coronary revascularization including PCI and CABG
  • Chronic intestinal diseases such as Kraun's disease, ulcerative colitis, etc.
  • Qualified samples were constructed into libraries, and Illumina hiseq4000 paired-end sequencing was performed. After obtaining raw metagenomic paired-end sequencing data, Trimmomatic software was used to perform quality control of the data to remove low-quality sequences and adapters. And use FastQC software to evaluate the data after quality control. For the data after quality control, metagenomic species annotation analysis was performed using MetaPhIAn2 software. After obtaining the species abundance information of the intestinal flora of cancer patients and normal people, the species diversity was analyzed, and LEfSe (LDA Effect Size) was used to analyze the differences in the flora between groups, and the characteristics of the intestinal flora of coronary heart disease were obtained. Level up the model for evaluation.
  • LDA Effect Size LDA Effect Size
  • the boruta algorithm is used for feature selection.
  • Use GridSearchCV grid search
  • Hyperopt to continuously adjust the parameters and select the optimal parameters.
  • Re-acquire a batch of external data that has never been involved in modeling use the constructed model to predict this batch of data, and use AUROC to judge the quality of the prediction model.
  • the importance of a feature is represented by its contribution to the model. All analyses were performed using the scikit-learn package in Python.
  • Figure 2 is the AUROC curve graph in the training set
  • Figure 3 is the screened biomarkers of stable coronary heart disease that play an important role in the model.
  • an embodiment of the present invention also provides a risk assessment device for stable coronary heart disease, as described in the following embodiments. Since the principle of solving these problems is similar to the risk assessment method of stable coronary heart disease, the implementation of the device can be referred to the implementation of the method, and the repetition will not be repeated.
  • FIG. 4 is a structural diagram of a risk assessment device for stable coronary heart disease in an embodiment of the present invention. As shown in FIG. 4 , the device includes:
  • the DNA data obtaining module 401 is used for obtaining DNA data of stool samples of patients with stable coronary heart disease and healthy people;
  • the paired-end sequencing processing module 402 is configured to perform paired-end sequencing processing on the fecal sample DNA data to obtain intestinal flora metagenomic data;
  • An annotation analysis module 403 configured to perform species annotation analysis and functional annotation analysis on the intestinal flora metagenomic data, to obtain relative abundance information of patients with stable coronary heart disease and healthy people;
  • the characteristic data determination module 404 is used for determining the characteristic data of intestinal flora according to the relative abundance information and the pre-screened biomarkers of stable coronary heart disease, and the biomarkers of stable coronary heart disease are based on differential bacteria.
  • the relative abundance history information is pre-screened, and the relative abundance history information of the differential bacteria is obtained by performing differential analysis on the relative abundance history information of stable coronary heart disease patients and healthy people;
  • the model training module 405 is used to input the intestinal flora characteristic data into a pre-established machine learning model for training to obtain a stable coronary heart disease risk assessment model;
  • a model testing module 407 configured to use the test data to test the parameter-adjusted machine learning model
  • a performance evaluation module 408, configured to perform performance evaluation on the machine learning model by using the AUROC indicator according to the test result
  • the risk assessment module 409 is configured to perform risk assessment of stable coronary heart disease by using the stable coronary heart disease risk assessment model qualified for performance evaluation.
  • biomarkers of stable coronary heart disease are pre-screened as follows:
  • the Boruta feature selection package was used to perform feature selection on the historical information of relative abundance of differential bacteria to determine the biomarkers of stable coronary heart disease.
  • the Boruta feature selection package is used to perform feature selection on the relative abundance history information of the differential bacteria in the following manner:
  • feature selection is performed on the relative abundance history information of differential bacteria.
  • the biomarkers for stable coronary heart disease of the present invention include: Bacteroides massiliensis, Unclassified Eggerthella unclassified, Klebsiella pneumoniae Klebsiella pneumoniae, Unclassified Clostridium Oscillibacter unclassified , Unclassified Paraprevotella unclassified, Lachnospiraceae bacterium_5_1_63FAA, Anaerostipes hadrus, Unclassified Bilophila unclassified, Eubacterium ventriosum, Prevotella copri , Human Ross Roseburia hominis, Barnesiella intestinihominis, Bacteroides xylanisolvens, Eubacterium hallii, Bacteroides plebeius, Megamonas unclassified, Parabacteroides distasonis, Escherichia coli .
  • Each biomarker is a risk factor for the onset of stable coronary heart disease, and the importance of the features used to assess the risk of stable coronary heart disease is shown in Figure 3.
  • Figure 5 shows that on the basis of some of the characteristic factors of the intestinal flora of the present invention, the factors of total cholesterol level, diabetes mellitus and age, which are traditionally considered to be closely related to stable coronary heart disease, are further integrated, and the obtained results are used for the treatment of acute coronary syndrome.
  • the AUROC curve of the model for assessing the risk of symptom onset It can be seen that after integrating the total cholesterol level, diabetes and age factors on the basis of some characteristic factors of the intestinal flora, the strength of the association with the risk of stable coronary heart disease does not increase significantly, which indicates that the intestinal flora of the present invention Characteristic factors can be used to assess the risk of stable CHD independently of traditional clinical risk factors (total cholesterol levels, diabetes, and age).
  • Embodiment 2 Risk Assessment of Acute Coronary Syndrome
  • the embodiment of the present invention provides a method for establishing a risk prediction model of acute coronary syndrome, and the method may include:
  • Species annotation analysis and functional annotation analysis were performed on the metagenomic data of gut microbiota qualified for quality assessment, and the relative abundance information of patients with acute coronary syndrome and healthy people was obtained;
  • the characteristic data of intestinal flora is determined, and the biomarkers of acute coronary syndrome are determined according to the historical information of relative abundance of differential bacteria.
  • the relative abundance history information of the differential bacteria is obtained by performing differential analysis on the relative abundance history information of patients with acute coronary syndrome and healthy people;
  • the intestinal flora characteristic data is input into a pre-established machine learning model for training, and an acute coronary syndrome risk prediction model is obtained.
  • a method for risk prediction of acute coronary syndrome comprising:
  • Risk prediction of acute coronary syndrome is performed using the acute coronary syndrome risk prediction model.
  • the historical information of relative abundance of differential bacteria is obtained by differentially analyzing the historical information of relative abundance of acute coronary syndrome patients and healthy people, including: the historical information of relative abundance of differential bacteria is obtained by using LDA Effect Size The software was obtained by differential analysis of the relative abundance historical information between patients with acute coronary syndrome and healthy people.
  • Qualified samples were constructed into libraries, and Illumina hiseq4000 paired-end sequencing was performed. After obtaining raw metagenomic paired-end sequencing data, Trimmomatic software was used to perform quality control of the data to remove low-quality sequences and adapters. And use FastQC software to evaluate the data after quality control. For the data after quality control, metagenomic species annotation analysis was performed using MetaPhIAn2 software. After obtaining the species abundance information of the intestinal flora of cancer patients and normal people, the species diversity was analyzed, and LEfSe (LDA Effect Size) was used to analyze the differences in the flora between groups, and the intestinal flora of patients with acute coronary syndrome was obtained. characteristics, and models were built at the species level for evaluation.
  • LDA Effect Size LDA Effect Size
  • the boruta algorithm is used for feature selection.
  • Use GridSearchCV grid search
  • Hyperopt to continuously adjust the parameters and select the optimal parameters. Re-acquire a batch of external data that has never been involved in modeling, use the constructed model to evaluate this batch of data, and use AUROC to judge the quality of the evaluation model. The importance of a feature is represented by its contribution to the model. All analyses were performed using the scikit-learn package in Python.
  • Figure 6 is the AUROC curve graph in the training set
  • Figure 7 is the screened biomarkers of acute coronary syndrome that play an important role in the model.
  • Each biomarker is a risk factor for the onset of acute coronary syndrome, and the importance of the features used to assess the risk of acute coronary syndrome is shown in Figure 7.
  • Figure 8 shows that on the basis of the gut microbiota characteristic factors of the present invention, the factors of total cholesterol level, age and hypertension, which are traditionally considered to be closely related to the risk of acute coronary syndrome, are further integrated, and the obtained results are used for the treatment of acute coronary syndromes.
  • AUROC curve of a model for assessing the risk of developing vascular syndrome Comparing it with Figure 6, it can be seen that after further integrating the factors of total cholesterol level, age and blood pressure, the correlation strength with the risk of acute coronary syndrome does not increase significantly, which can indicate that the intestinal bacteria of the present invention
  • Cluster characteristic factors can be used to assess the risk of developing acute coronary syndrome independently of traditional clinical risk factors (total cholesterol level, age, and hypertension).
  • Embodiment 3 Risk assessment of acute coronary syndrome for stable coronary heart disease
  • the embodiment of the present invention provides a method for establishing an acute coronary syndrome risk prediction model for stable coronary heart disease, the method may include:
  • Paired-end sequencing is performed on the DNA data of the screened fecal samples to obtain the metagenome data of intestinal flora;
  • the characteristic data of intestinal flora is determined, and the biomarkers of acute coronary syndrome are determined according to the historical information of relative abundance of differential bacteria.
  • the relative abundance history information of the differential bacteria is obtained by differentially analyzing the relative abundance history information of patients with acute coronary syndrome and patients with stable coronary heart disease;
  • the intestinal flora characteristic data is input into a pre-established machine learning model for training, and an acute coronary syndrome risk prediction model is obtained.
  • a method for risk assessment of acute coronary syndrome for stable coronary heart disease comprising:
  • the acute coronary syndrome risk prediction for stable coronary heart disease is performed using the acute coronary syndrome risk prediction model.
  • the relative abundance history information of the differential bacteria is obtained by performing differential analysis on the relative abundance history information of patients with acute coronary syndrome and patients with stable coronary heart disease, including: the relative abundance history information of the differential bacteria It was obtained by using the LDA Effect Size software to analyze the relative abundance historical information of patients with acute coronary syndrome and patients with stable coronary heart disease.
  • Stable CAD group stable plaque group
  • stable CAD group stable CAD group
  • Qualified samples were constructed into libraries, and Illumina hiseq4000 paired-end sequencing was performed. After obtaining raw metagenomic paired-end sequencing data, Trimmomatic software was used to perform quality control of the data to remove low-quality sequences and adapters. And use FastQC software to evaluate the data after quality control. For the data after quality control, metagenomic species annotation analysis was performed using MetaPhIAn2 software. After obtaining the species abundance information of the intestinal flora of cancer patients and normal people, the species diversity was analyzed, and LEfSe (LDA Effect Size) was used to analyze the differences in the flora between groups, and the intestinal flora of patients with acute coronary syndrome was obtained. characteristics, and build models at the species level to make predictions.
  • LDA Effect Size LDA Effect Size
  • the boruta algorithm is used for feature selection.
  • Use GridSearchCV grid search
  • Hyperopt to continuously adjust the parameters and select the optimal parameters. Re-acquire a batch of external data that has never been involved in modeling, use the constructed model to predict this batch of data, and use AUROC to judge the quality of the prediction model. The importance of a feature is represented by its contribution to the model. All analyses were performed using the scikit-learn package in Python.
  • FIG. 9 is the AUROC curve graph in the training set
  • FIG. 10 is the screened biomarkers of acute coronary syndrome that play an important role in the model.
  • Each biomarker is a risk factor for the onset of acute coronary syndrome, and the importance of the features used to assess the risk of acute coronary syndrome is shown in Figure 10. The greater the difference in the expression abundance of one or more biomarkers compared to healthy individuals, the higher the risk of developing acute coronary syndrome in individuals.
  • Figure 11 shows that on the basis of the characteristic factors of the intestinal flora of the present invention, the total cholesterol level and age factors, which are traditionally considered to be closely related to the risk of acute coronary syndrome, are further integrated, and the obtained results are used for the treatment of acute coronary syndrome.
  • AUROC curve of a model for risk assessment Comparing it with Figure 9, it can be seen that after further integrating the total cholesterol level and age factors, the strength of the association with the risk of acute coronary syndrome does not increase significantly, which can indicate the characteristics of the intestinal flora of the present invention.
  • Factors independent of traditional clinical risk factors can be used to assess the risk of developing acute coronary syndrome for stable coronary artery disease.
  • embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Public Health (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Genetics & Genomics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种心血管病发病风险评估肠道菌群标志物及其应用,所述标志物包括肠道菌群信息,所述肠道菌群包括至少10种肠道菌种,且所述肠道菌种为基于心血管病患者与健康人的肠道菌群宏基因组数据筛选的差异菌种。利用肠道菌群信息作为心血管病发病风险评估标志物进行风险评估,可提高评估准确率。

Description

心血管病发病风险评估肠道菌群标志物及其应用 技术领域
本发明属于生物医学技术领域,具体地说,是关于以肠道菌群作为心血管病发病风险评估标志物及其相关应用的疾病检测技术。
背景技术
心血管病主要指冠状动脉粥样硬化性心脏病,简称冠心病(coronary artery disease,CAD)。目前,主流观点认为:包括冠心病在内的心血管疾病是一类免疫代谢性疾病,也是一类全身性、进展性、炎症性疾病。主要病变是动脉粥样硬化斑块形成和炎症性进展,本质特征包括脂质沉积和炎症性细胞聚集所产生的非细菌性炎症反应,即被称为代谢性炎症。因为在粥样硬化斑块和进展过程中,从脂质条纹不断进展到粥样斑块,直至破裂,导致血栓形成的多个环节中,始终都有各种炎症细胞和大量炎症介质参与。由于冠心病的动态性和复杂性,炎性不稳定斑块的形成、进展、破裂的机制仍不清楚因此,若能阐明冠状动脉斑块炎症不稳定性的启动因素或原因,以及寻找源头干预炎症过程的有效方法,对于有效防范冠状动脉斑块炎症不稳定性的发生、进展和破裂以及急性冠脉综合征突发事件,大大降低心血管病的发病率和死亡率;对于保障人民的生命安全和身体健康均具有巨大而深远的社会意义和科学价值。
传统认为,总胆固醇(TC)、高血压(hypertension)、糖尿病(diabetes)、年龄等均是与心血管病有关联的风险因素,但这些因素因个体化差异较大,很难准确用于个体心血管病的风险评估。
另一方面,肠道黏膜是机体最大的具有免疫活性的器官,肠道内寄存的几百亿细菌称为“肠道微生物群”,宿主为肠道菌群提供了适当的环境和必要的营养。反过来,肠道菌群又参与调节人体的各种功能,如向宿主提供代谢营养、参与促进生长和免疫调节、消除致病微生物、保持肠道屏障的完整性和正常的体内平衡。随着新近研究发现,肠道微生物菌群在人类免疫炎症性疾病和代谢性疾病中发挥着源头调节作用,并与存在代谢性炎症和胰岛素抵抗状、动脉粥样硬化、肥胖和糖尿病等疾病密切相关,以及肠道菌群作为冠心病发生和发展的源头调控影响因素也露出冰山一角。有研究指出,冠心病患者存在肠道菌群失调,表现为大肠杆菌,链球菌和幽门螺杆菌的比例增加。肠道菌群可通过代谢途径、炎性反应等多个途径促进动脉粥样硬化形成。
然而,现有技术中并没有通过研究肠道菌群特征性针对心血管病进行发病风险评估的研究报道。另外,随着宏基因组学等各种测序技术的飞快发展,海量的数据也应运而生。如何 从庞杂冗余的生物数据中挖掘出能够对心血管病进行风险评估的生物标志物并实现心血管病的准确风险评估十分重要。
发明内容
本发明的一个目的在于提供一组与心血管病发病风险相关的标志物。
本发明的另一目的在于提供一种建立心血管病发病风险评估模型的方法。
本发明的另一目的在于提供一种心血管病发病风险评估模型。
本发明的另一目的在于提供一种心血管病发病风险评估装置。
本发明的另一目的在于提供一种心血管病发病风险评估方法。
本案发明人通过大量的研究与实际检测分析试验,确定了一组与心血管病发病风险相关的生物标志物,其包括多个肠道菌群,通过检测来自个体的样本中的这些肠道菌群的相关信息,可以良好地评估个体心血管病发病风险。
具体而言,一方面,本发明提供了检测个体信息的试剂在制备心血管病发病风险评估装置(评估系统)中的应用,其中,所述个体信息包括肠道菌群信息,所述肠道菌群包括至少10种肠道菌种,且所述肠道菌种为基于心血管病患者与健康人的肠道菌群宏基因组数据筛选的差异菌种。
根据本发明的具体实施方案,本发明的应用中,所述心血管病为稳定型冠心病、急性冠脉综合征或针对稳定型冠心病的急性冠脉综合征。
根据本发明的具体实施方案,本发明的应用中,(1)当所述心血管病为稳定型冠心病时,所述肠道菌群包括:马赛拟杆菌(Bacteroides massiliensis),未分类伊格尔兹氏菌(Eggerthella unclassified),肺炎克雷伯菌(Klebsiella pneumoniae),未分类梭状杆菌(Oscillibacter unclassified),未分类副雷沃菌(Paraprevotella unclassified),毛螺旋菌科_5_1_63FAA(Lachnospiraceae bacterium_5_1_63FAA),粪厌氧棒状菌(Anaerostipes hadrus),未分类嗜胆汁菌(Bilophila unclassified),人罗斯拜瑞氏菌(Roseburia hominis),腹真杆菌(Eubacterium ventriosum),人体普氏菌(Prevotella copri),肠巴氏杆菌(Barnesiella intestinihominis),木茴香类杆菌(Bacteroides xylanisolvens),真杆菌(Eubacterium hallii),巨单胞菌未分类(Megamonas unclassified),胸膜类杆菌(Bacteroides plebeius),副杆菌(Parabacteroides distasonis),大肠杆菌(Escherichia coli)。
根据本发明的具体实施方案,本发明的应用中,(2)当所述心血管病为急性冠脉综合征时,所述肠道菌群包括:长双歧杆菌(Bifidobacterium longum),毛螺旋菌科_5_1_63FAA(Lachnospiraceae bacterium_5_1_63FAA),另枝菌属(Alistipes onderdonkii),产气柯林斯菌 (Collinsella aerofaciens),真杆菌(Eubacterium eligens),普氏栖粪杆菌(Faecalibacterium prausnitzii),普通拟杆菌(Bacteroides vulgatus),未分类颤杆菌(Oscillibacter unclassified),卵形拟杆菌(Bacteroides ovatus),以及凸腹真桿菌(Eubacterium ventriosum)。
根据本发明的具体实施方案,本发明的应用中,(3)当所述心血管病为针对稳定型冠心病的急性冠脉综合征时,所述肠道菌群包括:长双歧杆菌(Bifidobacterium longum),咽峡炎链球菌(Streptococcus anginosus),陪伴粪球菌(Coprococcus comes),产气柯林斯菌(Collinsella aerofaciens),普氏栖粪杆菌(Faecalibacterium prausnitzii),卵形拟杆菌(Bacteroides ovatus),厌氧棍状菌属(Anaerotruncus colihominis),脆弱拟杆菌(Bacteroides fragilis),霍尔德曼氏菌(Holdemania filiformis),直肠真杆菌(Eubacterium rectale),以及唾液链球菌(Streptococcus salivarius)。
根据本发明的具体实施方案,本发明的应用中,所述肠道菌群中各菌在评估稳定型冠心病发病风险时的特征重要度,马赛拟杆菌﹥未分类伊格尔兹氏菌﹥肺炎克雷伯菌﹥未分类梭状杆菌=未分类副雷沃菌﹥毛螺旋菌科_5_1_63FAA﹥粪厌氧棒状菌﹥未分类嗜胆汁菌﹥人罗斯拜瑞氏菌=腹真杆菌=人体普氏菌﹥肠巴氏杆菌﹥木茴香类杆菌=真杆菌﹥巨单胞菌未分类=胸膜类杆菌=副杆菌﹥大肠杆菌。
根据本发明的具体实施方案,本发明的应用中,所述肠道菌群中各菌在评估稳定型冠心病发病风险时,所述肠道菌群中各菌按照以下特征重要度数值确定权重,或者,所述肠道菌群中各菌的权重比值为:马赛拟杆菌,23;未分类伊格尔兹氏菌,19;肺炎克雷伯菌,16;未分类梭状杆菌,15;未分类副雷沃菌,15;毛螺旋菌科_5_1_63FAA,13;粪厌氧棒状菌,11;未分类嗜胆汁菌,10;人罗斯拜瑞氏菌,8;腹真杆菌,8;人体普氏菌,8;肠巴氏杆菌,6;木茴香类杆菌,5;真杆菌,5;巨单胞菌未分类,4;胸膜类杆菌,4;副杆菌,4;大肠杆菌,1。
根据本发明的具体实施方案,本发明的应用中,所述肠道菌群中各菌在评估急性冠脉综合征发病风险时的特征重要度,长双歧杆菌﹥毛螺旋菌科_5_1_63FAA﹥另枝菌属﹥产气柯林斯菌﹥真杆菌﹥普氏栖粪杆菌﹥普通拟杆菌=未分类颤杆菌﹥卵形拟杆菌﹥凸腹真桿菌。
根据本发明的具体实施方案,本发明的应用中,所述肠道菌群中各菌在评估急性冠脉综合征发病风险时,按照以下特征重要度确定权重:长双歧杆菌,47;毛螺旋菌科_5_1_63FAA,44;另枝菌属,43;产气柯林斯菌,32;真杆菌,31;普氏栖粪杆菌,30;普通拟杆菌,28;未分类颤杆菌,28;卵形拟杆菌,20;凸腹真桿菌,14。
根据本发明的具体实施方案,本发明的应用中,所述肠道菌群中各菌在评估针对稳定型冠心病的急性冠脉综合征风险时的特征重要度,长双歧杆菌﹥咽峡炎链球菌﹥陪伴粪球菌=产气柯林斯菌﹥普氏栖粪杆菌﹥卵形拟杆菌=厌氧棍状菌属﹥脆弱拟杆菌﹥霍尔德曼氏菌﹥直肠真杆菌=唾液链球菌。
根据本发明的具体实施方案,本发明的应用中,所述肠道菌群中各菌在评估针对稳定型冠心病的急性冠脉综合征风险时,所述肠道菌群中各菌按照以下特征重要度数值确定权重,或者,所述肠道菌群中各菌的权重比值为:长双歧杆菌,13;咽峡炎链球菌,11;陪伴粪球菌,10;产气柯林斯菌,10;普氏栖粪杆菌,9;卵形拟杆菌,8;厌氧棍状菌属,8;脆弱拟杆菌,7;霍尔德曼氏菌,6;直肠真杆菌,4;唾液链球菌,4。
根据本发明的具体实施方案,本发明的应用中,本发明作为标志物的所述各肠道菌群均为针对心血管病的发病风险因素。各风险因素的异常程度越高(各肠道菌菌相比于健康人的表达丰度差异越大),个体心血管病发病风险越高。
根据本发明的一些优选具体实施方案,本发明的应用中,所述个体信息还可进一步包括总胆固醇水平、高血压、糖尿病、年龄中的一项或多项。
根据本发明的具体实施方案,本发明的技术特别适用于对来自东亚人群的个体进行稳定型冠心病发病风险评估。
另一方面,本发明提供了一种心血管病发病风险评估装置,其包括检测单元和数据分析单元,其中:
所述检测单元用于检测个体信息,获得检测结果;其中,所述个体信息同权利要求1-5任一项中所述个体信息;
所述数据分析单元用于对检测单元的检测结果进行分析处理。
本发明的针对稳定型冠心病的急性冠脉综合征风险评估装置中,所述检测单元包括可获得待测个体肠道菌群中各特征菌信息的任何试剂材料,可以采用现有技术中任何可行的方法检测待测个体肠道菌群中各特征菌的信息。
具体地,本发明所述的心血管病发病风险评估装置中,所述检测单元包括检测粪便样本DNA数据的试剂材料。
具体地,本发明所述的心血管病发病风险评估装置中,所述数据分析单元用于对检测单元的检测结果进行分析处理的过程包括:
对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到肠道菌群中各菌的相对丰度信息;
根据所述相对丰度信息,确定肠道菌群特征数据。
具体地,本发明所述的心血管病发病风险评估装置中,所述数据分析单元对检测单元的检测结果进行分析处理时,包括:将个体信息的检测结果配以权重系数,以计算所述待测个体的风险评估得分。
根据本发明的一些具体实施方案,本发明实施例提供一种建立稳定型冠心病的发病风险评估(预测)模型的方法,以将所建立的模型用以对稳定型冠心病进行发病风险评估,提高评估准确率,该方法包括:
获得稳定型冠心病患者和健康人群的粪便样本DNA数据;
对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到稳定型冠心病患者和健康人群的相对丰度信息;
根据所述相对丰度信息和预先筛选的稳定型冠心病的生物标志物,确定肠道菌群特征数据,所述稳定型冠心病的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的;
将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到稳定型冠心病风险评估模型。
根据本发明的具体实施方案,本发明实施例提供的建立稳定型冠心病的发病风险评估模型的方法还包括:
利用GridSearchCV算法和Hyperopt算法对所述机器学习模型进行参数调整;
利用测试数据对参数调整后的机器学习模型进行测试;
根据测试的结果,利用AUROC指标对机器学习模型进行性能评价。
根据本发明的一些具体实施方案,本发明还提供了利用性能评价合格的稳定型冠心病发病风险评估模型进行稳定型冠心病的发病风险评估的方法。
本发明实施例提供一种用于建立稳定型冠心病的发病风险评估模型的装置,用以对稳定型冠心病进行风险评估,以将所建立的模型提高评估准确率,该装置包括:
DNA数据获得模块,用于获得稳定型冠心病患者和健康人群的粪便样本DNA数据;
双端测序处理模块,用于对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
注释分析模块,用于对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到稳定型冠心病患者和健康人群的相对丰度信息;
特征数据确定模块,用于根据所述相对丰度信息和预先筛选的稳定型冠心病的生物标志物,确定肠道菌群特征数据,所述稳定型冠心病的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的;
模型训练模块,用于将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到稳定型冠心病风险评估模型。
根据本发明的具体实施方案,本发明提供的用于建立稳定型冠心病的发病风险评估模型的装置还包括:
参数调整模块,用于利用GridSearchCV算法和Hyperopt算法对所述机器学习模型进行参数调整;
模型测试模块,用于利用测试数据对参数调整后的机器学习模型进行测试;
性能评价模块,用于根据测试的结果,利用AUROC指标对机器学习模型进行性能评价。
根据本发明的一些具体实施方案,本发明实施例提供一种建立急性冠脉综合征的风险预测模型的方法,以将所建立的模型用以对急性冠脉综合征进行风险预测,提高预测准确率,该方法包括:
获得急性冠脉综合征患者和健康人群的粪便样本DNA数据;
对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
利用Trimmomatic软件去除肠道菌群宏基因组数据中的接头,并根据预先设定的碱基质量值,对去除接头的肠道菌群宏基因组数据进行修剪;
利用FastQC软件对修剪后的肠道菌群宏基因组数据进行质量评估;
对质量评估合格的肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到急性冠脉综合征患者和健康人群的相对丰度信息;
根据所述相对丰度信息和预先筛选的急性冠脉综合征的生物标志物,确定肠道菌群特征数据,所述急性冠脉综合征的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对急性冠脉综合征患者和健康人群的相对丰度历史信息进行差异分析得到的;
将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到急性冠脉综合征风险预测模型。
根据本发明的一些具体实施方案,本发明还提供了利用所述急性冠脉综合征风险预测模型进行急性冠脉综合征的风险预测的方法。
本发明一些实施方案中,还提供一种用于建立急性冠脉综合征的风险预测模型的装置,以将所建立的模型用以对急性冠脉综合征进行风险预测,提高预测准确率,该装置包括:
DNA数据获得模块,用于获得急性冠脉综合征患者和健康人群的粪便样本DNA数据;
双端测序处理模块,用于对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
数据修剪模块,用于利用Trimmomatic软件去除肠道菌群宏基因组数据中的接头,并根据预先设定的碱基质量值,对去除接头的肠道菌群宏基因组数据进行修剪;
质量评估模块,用于利用FastQC软件对修剪后的肠道菌群宏基因组数据进行质量评估;
注释分析模块,用于对质量评估合格的肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到急性冠脉综合征患者和健康人群的相对丰度信息;
特征数据确定模块,用于根据所述相对丰度信息和预先筛选的急性冠脉综合征的生物标志物,确定肠道菌群特征数据,所述急性冠脉综合征的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对急性冠脉综合征患者和健康人群的相对丰度历史信息进行差异分析得到的;
模型训练模块,用于将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到急性冠脉综合征风险预测模型。
根据本发明的一些具体实施方案,本发明实施例提供一种建立针对稳定型冠心病的急性冠脉综合征风险预测(评估)模型的方法,以将所建立的模型用以对急性冠脉综合征进行风险预测,提高预测准确率,该方法包括:
获得急性冠脉综合征患者和稳定型冠心病患者的粪便样本DNA数据;
利用琼脂糖凝胶方法确定所述粪便样本DNA数据的总量数据和总浓度数据;
将所述总量数据与总浓度数据与预设阈值进行比较,根据比较的结果对所述粪便样本DNA数据进行筛选;
对筛选出的粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到急性冠脉综合征患者和稳定型冠心病患者的相对丰度信息;
根据所述相对丰度信息和预先筛选的急性冠脉综合征的生物标志物,确定肠道菌群特征数据,所述急性冠脉综合征的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对急性冠脉综合征患者和稳定型冠心病患者的相对丰度历史信息进行差异分析得到的;
将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到急性冠脉综合 征风险预测模型。
根据本发明的一些具体实施方案,本发明还提供了利用所述急性冠脉综合征风险预测模型进行针对稳定型冠心病的急性冠脉综合征风险预测的方法。
本发明实施例提供一种用于建立针对稳定型冠心病的急性冠脉综合征风险预测模型的装置,以将所建立的模型用以对急性冠脉综合征进行风险预测,提高预测准确率,该装置包括:
DNA数据获得模块,用于获得急性冠脉综合征患者和稳定型冠心病患者的粪便样本DNA数据;
浓度数据确定模块,用于利用琼脂糖凝胶方法确定所述粪便样本DNA数据的总量数据和总浓度数据;
DNA数据筛选模块,用于将所述总量数据与总浓度数据与预设阈值进行比较,根据比较的结果对所述粪便样本DNA数据进行筛选;
双端测序处理模块,用于对筛选出的粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
注释分析模块,用于对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到急性冠脉综合征患者和稳定型冠心病患者的相对丰度信息;
特征数据确定模块,用于根据所述相对丰度信息和预先筛选的急性冠脉综合征的生物标志物,确定肠道菌群特征数据,所述急性冠脉综合征的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对急性冠脉综合征患者和稳定型冠心病患者的相对丰度历史信息进行差异分析得到的;
模型训练模块,用于将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到急性冠脉综合征风险预测模型。
本发明的另一些实施方案中,还提供了一种心血管病发病风险评估装置,其包括:风险评估模块,用于利用性能评价合格的心血管病发病风险评估模型进行心血管病发病风险评估。
另一方面,本发明还提供了另一种计算机设备,其包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:基于待测个体信息获得个体心血管病发病风险评估结果;
其中,所述个体信息同本发明前述个体信息。
另一方面,本发明还提供了另一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序指令,所述计算机程序指令被执行时实现:基于待测个体信息获得个体心血管病发病风险评估结果;
其中,所述个体信息同本发明前述个体信息。
本发明实施例充分考虑到心血管病患者的肠道菌群特征,利用机器学习算法从复杂、繁冗的生物大数据中筛选可用于评估及监测心血管病发病风险的、无创的生物标志物,提高评估准确率,弥补了心血管病临床预警的空白。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1为本发明实施例中稳定型冠心病的风险评估方法示意图;
图2为本发明实施例中训练集中的AUROC曲线图;
图3为本发明实施例中筛到的对模型起重要作用的稳定型冠心病的生物标志物示意图;
图4为本发明实施例中稳定型冠心病的风险评估装置结构图。
图5为另一实施方案中的稳定型冠心病发病风险评估模型的AUROC曲线图。
图6为本发明一具体实施例中急性冠脉综合征的风险预测训练集中的AUROC曲线图。
图7为本发明一具体实施例中筛到的对模型起重要作用的急性冠脉综合征的生物标志物示意图。
图8为本发明另一具体实施例的急性冠脉综合征风险评估模型的AUROC曲线图。
图9为本发明实施例中针对稳定型冠心病的急性冠脉综合征风险预测训练集中的AUROC曲线图。
图10为本发明实施例中筛到的对模型起重要作用的针对稳定型冠心病的急性冠脉综合征的生物标志物示意图。
图11为本发明另一具体实施例的针对稳定型冠心病的急性冠脉综合征风险评估模型的AUROC曲线图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚明白,下面结合附图对本发明实施例做进一步详细说明。在此,本发明的示意性实施例及其说明用于解释本发明,但并不作为对本发明的限定。
如前所述,随着宏基因组学等各种测序技术的飞快发展,海量的数据也应运而生。如何 从庞杂冗余的生物数据中挖掘有用的信息,用于疾病的评估评估、诊断指标,一直是一项极具挑战的事情。随着大数据时代的来临,科研人员开发了多种算法进行生命科学领域相关数据的挖掘,而对于标志物诊断模型而言,不得不提的就是机器学习算法。机器学习包含了多种方法:线性回归、随机森林等。不同的算法适用的情况和条件不同,易受到生物样本的个体差异,实验方法等影响。
实施方案一:稳定型冠心病发病风险评估
为了对稳定型冠心病进行风险评估评估,提高评估准确率,本发明实施例提供一种建立稳定型冠心病的风险评估模型的方法,如图1所示,该方法可以包括:
步骤101、获得稳定型冠心病患者和健康人群的粪便样本DNA数据;
步骤102、对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
步骤103、对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到稳定型冠心病患者和健康人群的相对丰度信息;
步骤104、根据所述相对丰度信息和预先筛选的稳定型冠心病的生物标志物,确定肠道菌群特征数据,所述稳定型冠心病的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的;
步骤105、将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到稳定型冠心病风险评估模型;
步骤106、利用GridSearchCV算法和Hyperopt算法对所述机器学习模型进行参数调整;
步骤107、利用测试数据对参数调整后的机器学习模型进行测试;
步骤108、根据测试的结果,利用AUROC指标对机器学习模型进行性能评价。
进一步,本发明还提供了一种稳定型冠心病发病风险评估方法,该方法包括:
步骤109、利用性能评价合格的稳定型冠心病风险评估模型进行稳定型冠心病的发病风险评估。
由图1所示可以得知,本发明实施例通过获得稳定型冠心病患者和健康人群的粪便样本DNA数据;对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到稳定型冠心病患者和健康人群的相对丰度信息;根据所述相对丰度信息和预先筛选的稳定型冠心病的生物标志物,确定肠道菌群特征数据,所述稳定型冠心病的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的;将所述肠道菌群特征数据输入预先建立的机器学习模型中进 行训练,得到稳定型冠心病风险评估模型;利用GridSearchCV算法和Hyperopt算法对所述机器学习模型进行参数调整;利用测试数据对参数调整后的机器学习模型进行测试;根据测试的结果,利用AUROC指标对机器学习模型进行性能评价;利用性能评价合格的稳定型冠心病风险评估模型进行稳定型冠心病的风险评估。本发明实施例充分考虑到稳定型冠心病患者的肠道菌群特征,利用机器学习算法从复杂、繁冗的生物大数据中筛选可用于评估及监测稳定型冠心病风险的、无创的生物标志物,提高评估准确率,弥补了稳定型冠心病临床预警的空白。
实施例中,获得稳定型冠心病患者和健康人群的粪便样本DNA数据。
本实施例中,获得稳定型冠心病患者和健康人群的粪便样本DNA数据之后,利用琼脂糖凝胶方法确定所述粪便样本DNA数据的总量数据和总浓度数据;将所述总量数据与总浓度数据与预设阈值进行比较;根据比较的结果对所述粪便样本DNA数据进行筛选。
实施例中,对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据。
本实施例中,得到肠道菌群宏基因组数据之后,利用Trimmomatic软件去除肠道菌群宏基因组数据中的接头,并根据预先设定的碱基质量值,对去除接头的肠道菌群宏基因组数据进行修剪;利用FastQC软件对修剪后的肠道菌群宏基因组数据进行质量评估;对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,包括:对质量评估合格的肠道菌群宏基因组数据进行物种注释分析和功能注释分析。
具体实施时,在患者接受项目检测后收集其粪便样本,并在30分钟内放入干冰保存,并尽快储存在-80℃冰箱中待测。提取DNA,对提取的核酸物质利用琼脂糖凝胶方法进行质量控制,要求DNA总量≥1μg,DNA总浓度≥20ng/μL,对质量合格的样本进行建库,然后对粪便样本DNA数据进行illumina hiseq4000双端测序,得到每一个样本的双端测序数据,以FASTQ文件存储。FASTQ是一种存储了生物序列(通常是核酸序列)以及相应的质量评价的文本格式,它们都是以ASCII编码的,几乎是高通量测序的标准格式。
具体实施时,用Trimmomatic软件对数据进行质量控制,即修剪和去除原始数据中的接头(adapter)和低质量序列。Trimmomatic是一个广受欢迎的Illumina平台数据过滤工具,其支持多线程,处理数据速度快,主要用来去除Fastq序列中的接头,并根据碱基质量值对Fastq进行修剪。它包含双端测序和单端测序两种模式同时支持gzip和bzip2压缩文件,也支持phred-33和phred-64格式互相转化。FastQC是一款基于Java的软件,它可以快速地对测序数据进行质量评估。对过滤后的数据,用FastQC软件评价质控后的数据质量。根据FastQC的分析结果,可以判断FASTQ测序文件的质量。如果FASTQ测序文件质量合格,则进行后续的数据分析;否则,要重做调整参数,利用Trimmomatic软件对双端测序数据进 行修剪。需要说明的是,测序出来的序列每个碱基都对应有一个质量值(用字母或符号表示,可转为ASCII值减去64来看),这个质量值代表测出的这个碱基的准确性,如果这条序列普遍质量值较低或平均质量值小于20,也或N很多也算低质量序列。
实施例中,对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到稳定型冠心病患者和健康人群的相对丰度信息。
本实施例中,对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,包括:下载肠道菌群数据库,所述肠道菌群数据库包括多个参考基因组,所述参考基因组包括:细菌,古菌,病毒和真核生物;根据所述肠道菌群数据库,利用MetaPhIAn2软件对肠道菌群宏基因组数据进行物种注释分析,利用HUMAnN2软件对肠道菌群宏基因组数据进行功能注释分析。
本实施例中,对质控后的数据,采用MetaPhIAn2软件进行宏基因组物种注释分析。MetaPhIAn2整理了17000多个参考基因组,包括13500个细菌和古菌,3500个病毒和110种真核生物。下载对应的数据库后,采用该软件,可以实现精确的分类群分配以及准确的计算物种的相对丰度。其能达到种水平的精度,以及菌株水平的鉴定和追踪。对肠道菌群宏基因组数据进行物种注释和功能注释后,得到肠道菌群的物种丰度信息建立模型进行评估。
本实施例中,采用R软件包vegan分析物种多样性,输入文件为肠道菌群物种丰度数据。LEfSe(LDA Effect Size)有网页运行版本(http://huttenhower.sph.harvard.edu/galaxy/),准备好肠菌物种丰度数据,输入到网页运行版本中,按照默认流程运行,可得到结果,即组间的差异菌群。这里的冠心病肠道菌群特征数据,即从LEfSe分析得到的差异菌物种丰度数据。
实施例中,根据所述相对丰度信息和预先筛选的稳定型冠心病的生物标志物,确定肠道菌群特征数据,所述稳定型冠心病的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的。
本实施例中,按如下方式对所述稳定型冠心病的生物标志物进行预先筛选:利用Boruta特征选择包对差异菌相对丰度历史信息进行特征选择,确定稳定型冠心病的生物标志物。
本实施例中,按如下方式利用Boruta特征选择包对所述差异菌相对丰度历史信息进行特征选择:根据差异菌相对丰度历史信息,创建阴影特征矩阵;根据所述阴影特征矩阵确定真实特征数据和阴影特征数据;根据所述真实特征数据和阴影特征数据,确定每个差异菌相对丰度历史信息对应的重要度标签;根据所述重要度标签,对差异菌相对丰度历史信息进行特征选择。
本实施例中,所述预先筛选的稳定型冠心病的生物标志物包括:马赛拟杆菌Bacteroides massiliensis,未分类伊格尔兹氏菌Eggerthella unclassified,肺炎克雷伯菌Klebsiella pneumoniae,未分类梭状杆菌Oscillibacter unclassified,未分类副雷沃菌Paraprevotella unclassified,毛螺旋菌科_5_1_63FAA Lachnospiraceae bacterium_5_1_63FAA,粪厌氧棒状菌Anaerostipes hadrus,未分类嗜胆汁菌Bilophila unclassified,腹真杆菌Eubacterium ventriosum,人体普氏菌Prevotella copri,人罗斯拜瑞氏菌Roseburia hominis,肠巴氏杆菌Barnesiella intestinihominis,木茴香类杆菌Bacteroides xylanisolvens,真杆菌Eubacterium hallii,胸膜类杆菌Bacteroides plebeius,巨单胞菌未分类Megamonas unclassified,副杆菌Parabacteroides distasonis,大肠杆菌Escherichia coli。
本实施例中,所述差异菌相对丰度历史信息是对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的,包括:所述差异菌相对丰度历史信息是利用LDA Effect Size软件对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的。
具体实施时,采用boruta算法进行特征选择。Boruta的目标就是选择出所有与因变量相关的特征集合,而不是针对特定模型选择出可以使得模型cost function最小的特征集合。Boruta算法的意义在于可以帮助本发明更全面的理解因变量的影响因素,从而更好、更高效地进行特征选择。Boruta是python中的一个特征选择包,安装该包后输入差异菌相对丰度历史信息,可以得到适合建模的重要特征。其具体算法步骤为:(1)创建阴影特征(shadow feature):对每个真实特征R,随机打乱顺序,得到阴影特征矩阵S,拼接到真实特征后面,构成新的特征矩阵N=[R,S];(2)用新的特征矩阵N作为输入,训练模型,得到真实特征和阴影特征;(3)取阴影特征的最大值,真实特征中大于该值的,记录一次命中;(4)用(3)中记录的真实特征累计命中,标记特征重要或不重要;(5)删除不重要的特征,重复1-4,直到所有特征都被标记。
实施例中,将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到稳定型冠心病风险评估模型。利用GridSearchCV算法和Hyperopt算法对所述机器学习模型进行参数调整。利用测试数据对参数调整后的机器学习模型进行测试。根据测试的结果,利用AUROC指标对机器学习模型进行性能评价。利用性能评价合格的稳定型冠心病风险评估模型进行稳定型冠心病的风险评估。
本实施例中,将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,包括:将所述肠道菌群特征数据输入预先建立的LightGBM机器学习模型进行训练。利用GridSearchCV算法和Hyperopt算法对所述LightGBM机器学习模型进行参数调整;利用测试数据对参数调整后的LightGBM机器学习模型进行测试;根据测试的结果,利用AUROC 指标对LightGBM机器学习模型进行性能评价。
本实施例中,GridSearchCV(网格搜索)调整参数,即在指定的参数范围内,按步长依次调整参数,利用调整的参数训练学习器,从所有的参数中找到在验证集上精度最高的参数,这其实是一个循环和比较的过程。LightGBM是比Xgboost更强大、速度更快的模型,性能上有很大的提升,与传统算法相比具有的优点:更快的训练效率、低内存使用、更高的准确率、支持并行化学习、可处理大规模数据。采用Hyperopt对新模型进一步参数调优,Hyperopt是一种通过贝叶斯优化来调整参数的工具,该方法较快的速度,并有较好的效果。此外,Hyperopt结合MongoDB可以进行分布式调参,快速找到相对较优的参数。
本实施例中,采用的是python中的lightgbm包进行LightGBM机器学习构建模型。该模型主要包含两个算法:单边梯度采样(GOSS)和互斥特征绑定(EFB)。GOSS(从减少样本角度):排除大部分小梯度的样本,仅用剩下的样本计算信息增益。每个数据实例有不同的梯度,根据计算信息增益的定义,梯度大的实例对信息增益有更大的影响,因此在采样时,尽量保留梯度大的样本(预先设定阈值,或者最高百分位间),随机去掉梯度小的样本。此措施在相同的采样率下比随机采样获得更准确的结果,尤其是在信息增益范围较大时。EFB(从减少特征角度):捆绑互斥特征,也就是用一个合成特征代替,特别在稀疏特征空间上,许多特征几乎是互斥的(例如许多特征不会同时为非零值)。可以捆绑互斥的特征,将捆绑问题归约到图着色问题,通过贪心算法求得近似解。更具体地,相关参数可以设置如下:
params={'boosting_type':'gbdt',
'objective':'binary',
'metric':'auc',
'nthread':4,
'learning_rate':0.1,
'num_leaves':30,
'max_depth':5,
'subsample':0.8,
'colsample_bytree':0.8,}
其中,gbdt即梯度提升树,nthread服务器运行的线程,learning_rate即每个弱学习器的权重缩减系数,num_leaves即每个基学习器输出one-hot向量(长度),max_depth即决策树最大深度,subsample即子采样比例,取值范围为(0,1],colsample_bytree即用来控制每颗树随机采样的列数的占比。
本实施例中,GridSearchCV和Hyperopt是python中给的包,本发明在python中安装这 些包后,进行参数调优。GridSearchCV的名字其实可以拆分为两部分,GridSearch和CV,即网格搜索和交叉验证。网格搜索,搜索的是参数,即在指定的参数范围内,按步长依次调整参数,利用调整的参数训练学习器,从所有的参数中找到在验证集上精度最高的参数,这其实是一个训练和比较的过程。Hyperopt是python中的一个用于"分布式异步算法组态/超参数优化"的类库。使用它本发明可以拜托繁杂的超参数优化过程,自动获取最佳的超参数。广泛意义上,可以将带有超参数的模型看作是一个必然的非凸函数,因此hyperopt几乎可以稳定的获取比手工更加合理的调参结果。尤其对于调参比较复杂的模型而言,其更是能以远快于人工调参的速度同样获得远远超过人工调参的最终性能。
本实施例中,AUROC的全称是“接受者操作特征曲线下面积”,往往作为一个评价模型预测能力的指标。在讨论AUROC曲线之前,本发明需要理解混淆矩阵(confusion matrix)的概念。一个二元预测可能有4个结果:本发明预测0,而真实类别是0:这被称为真阴性(TN,True Negative);本发明预测0,而真实类别是1:这被称为假阴性(FN,False Negative);本发明预测1,而真实类别是0:这被称为假阳性(FP,False Positive);本发明预测1,而真实类别是1:这被称为真阳性(TP,True Positive)。当比较两个不同模型的时候,使用单一指标常常比使用多个指标更方便,下面本发明基于混淆矩阵计算两个指标,之后本发明会将这两个指标组合成一个:
真阳性率(TPR),即,灵敏度、命中率、召回,定义为TP/(TP+FN)。这一指标对应被正确识别为阳性的阳性数据点占所有阳性数据点的比例。换句话说,TPR越高,本发明遗漏的阳性数据点就越少。
假阳性率(FPR),即,误检率,定义为FP/(FP+TN)。这一指标对应被误认为阳性的阴性数据点占所有阴性数据点的比例。换句话说,FPR越高,本发明错误分类的阴性数据点就越多。
为了将FPR和TPR组合成一个指标,本发明首先基于不同的阈值(例如:0.00;0.01,0.02,…,1.00)计算前两个指标的逻辑回归,接着将它们绘制为一个图像,其中FPR值为横轴,TPR值为纵轴。得到的曲线为ROC曲线,本发明考虑的指标是该曲线的AUC,称为AUROC。对角虚线为随机预测器的ROC曲线:AUROC为0.5。随机预测器通常用作基线,以检验模型是否有用。AUROC越高,说明模型的预测能力越好。
下面给出一个具体实施例,说明本发明稳定型冠心病的风险评估方法的具体应用。
1、临床入组标准:
依据冠状动脉粥样硬化性心脏病的临床特点,将病人分为2组,包括:(1)稳定性CAD组(斑块稳定组),即stable CAD组,sCAD,N=213;(2)无动脉粥样硬化斑块的正常对照组,即 normal coronary artery组,NCA,N=175。在临床信息收集的基础上,采集各组人群的新鲜或妥善冷冻的粪便,进行肠道宏基因组测序。
研究人群入选标准:稳定性冠心病(陈旧心梗、PCI史、稳定性心绞痛或无临床缺血症状的“健康人”,同时冠脉CT/造影发现有冠脉狭窄病变>50%)。
排除标准:
1)根据国际通用心肌梗死定义诊断为2-5型心肌梗死;
2)严重心力衰竭/心源性休克(Killip>2级或NYHA>2级);
3)存在机械并发症(室间隔穿孔、游离壁破裂、乳头肌断裂等);
4)发病后曾发生心脏骤停和/或心肺复苏;
5)3月内口服或使用静脉任何抗生素≥1周;
6)3月内急性冠状动脉综合征(ACS)或冠状动脉血管重建(包括PCI和CABG);
7)3月内创伤或手术;
8)3月内脑血管病史(包括脑梗死或脑出血);
9)3月内上消化道或下消化道出血;
10)3月内明确感染(包括消化道、呼吸道、体表感染等);
11)慢性肠道疾病(如克劳恩病、溃疡性结肠炎等等);
12)任何肿瘤;
13)风湿免疫性疾病;
14)慢性肾脏疾病,包括肾脏移植术后。
研究对象入选及病例信息收集过程:
(1)知情同意书;
(2)入选/排除标准;
(3)患者生活方式问卷临床资料;
(4)在临床信息收集的基础上,采集各组人群的血液、新鲜或妥善冷冻的粪便,进行组学分析。
本临床研究遵守《世界医学大会赫尔辛基宣言》和国家相关法规的要求实施。本临床研究方案已获阜外医院的医学伦理委员会批准,所有参与实验的临床患者均已签署本项目《知情同意书》。
2、实施方法:
共有388名参与者在国家心血管病中心、中国医学科学院阜外医院参加了本次研究。根据诊断指南和排除标准将其分为以下两组:NCA组(N=175),sCAD组(N=213)。
在患者入院的第二天上午,空腹时间大于10小时的条件下采集病人的血液样本,由阜外医院完成相关临床常规生化指标检测,所有检测均按照国际标准方法进行。同时收集患者粪便样本,并在30分钟内放入干冰保存,并尽快储存在-80℃冰箱中待测。提取DNA,对提取的核酸物质利用琼脂糖凝胶方法进行质量控制。要求DNA总量≥1μg,DNA总浓度≥20ng/μL。对质量合格的样本进行建库,illumina hiseq4000双端测序。获取原始宏基因组双端测序数据后,用Trimmomatic软件对数据进行质量控制,去除低质量序列和接头。并用FastQC软件评价质控后的数据。对质控后的数据,采用MetaPhIAn2软件进行宏基因组物种注释分析。获取癌症患者与正常人肠道菌群的物种的丰度信息后,分析物种多样性,并采用LEfSe(LDA Effect Size)分析组间菌群差异,获得冠心病肠道菌群的特征,在物种水平建立模型进行评估。采用LightGBM的机器学习方法建模及十乘十交叉验证的方法,将数据随机分成训练集和测试集。首先采用boruta算法进行特征选择。采用GridSearchCV(网格搜索)和Hyperopt不断调整参数,选择最优的参数。重新获取一批从未参与建模的外部数据,将构建好的模型用于预测这批数据,通过AUROC来判断预测模型的好坏。特征的重要性用其对模型的贡献度表示。所有的分析采用Python的scikit-learn包。图2为训练集中的AUROC曲线图,图3为筛到的对模型起重要作用的稳定型冠心病的生物标志物。
基于同一发明构思,本发明实施例还提供了一种稳定型冠心病的风险评估装置,如下面的实施例所述。由于这些解决问题的原理与稳定型冠心病的风险评估方法相似,因此装置的实施可以参见方法的实施,重复之处不再赘述。
图4为本发明实施例中稳定型冠心病的风险评估装置的结构图,如图4所示,该装置包括:
DNA数据获得模块401,用于获得稳定型冠心病患者和健康人群的粪便样本DNA数据;
双端测序处理模块402,用于对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
注释分析模块403,用于对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到稳定型冠心病患者和健康人群的相对丰度信息;
特征数据确定模块404,用于根据所述相对丰度信息和预先筛选的稳定型冠心病的生物标志物,确定肠道菌群特征数据,所述稳定型冠心病的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对稳定型冠心病患者和健康人群的相对丰度历史信息进行差异分析得到的;
模型训练模块405,用于将所述肠道菌群特征数据输入预先建立的机器学习模型中进行 训练,得到稳定型冠心病风险评估模型;
参数调整模块406,用于利用GridSearchCV算法和Hyperopt算法对所述机器学习模型进行参数调整;
模型测试模块407,用于利用测试数据对参数调整后的机器学习模型进行测试;
性能评价模块408,用于根据测试的结果,利用AUROC指标对机器学习模型进行性能评价;
风险评估模块409,用于利用性能评价合格的稳定型冠心病风险评估模型进行稳定型冠心病的风险评估。
一个实施例中,按如下方式对所述稳定型冠心病的生物标志物进行预先筛选:
利用Boruta特征选择包对差异菌相对丰度历史信息进行特征选择,确定稳定型冠心病的生物标志物。
一个实施例中,按如下方式利用Boruta特征选择包对所述差异菌相对丰度历史信息进行特征选择:
根据差异菌相对丰度历史信息,创建阴影特征矩阵;
根据所述阴影特征矩阵确定真实特征数据和阴影特征数据;
根据所述真实特征数据和阴影特征数据,确定每个差异菌相对丰度历史信息对应的重要度标签;
根据所述重要度标签,对差异菌相对丰度历史信息进行特征选择。
一个实施例中,本发明的稳定型冠心病的生物标志物包括:马赛拟杆菌Bacteroides massiliensis,未分类伊格尔兹氏菌Eggerthella unclassified,肺炎克雷伯菌Klebsiella pneumoniae,未分类梭状杆菌Oscillibacter unclassified,未分类副雷沃菌Paraprevotella unclassified,毛螺旋菌科_5_1_63FAA Lachnospiraceae bacterium_5_1_63FAA,粪厌氧棒状菌Anaerostipes hadrus,未分类嗜胆汁菌Bilophila unclassified,腹真杆菌Eubacterium ventriosum,人体普氏菌Prevotella copri,人罗斯拜瑞氏菌Roseburia hominis,肠巴氏杆菌Barnesiella intestinihominis,木茴香类杆菌Bacteroides xylanisolvens,真杆菌Eubacterium hallii,胸膜类杆菌Bacteroides plebeius,巨单胞菌未分类Megamonas unclassified,副杆菌Parabacteroides distasonis,大肠杆菌Escherichia coli。各生物标志物均为稳定型冠心病发病风险因素,用于评估稳定型冠心病发病风险时的特征重要度参见图3。如果某一项或多项生物标志物相比于健康人的表达丰度差异越大,则个体稳定型冠心病发病风险越高。
图5显示了在本发明的部分肠道菌群特征因素的基础上,进一步整合传统认为与稳定型冠心病密切相关的总胆固醇水平、糖尿病和年龄因素,所获得的用于对急性冠脉综合征发病 风险进行评估的模型的AUROC曲线。可以看出,在部分肠道菌群特征因素的基础上整合总胆固醇水平、糖尿病和年龄因素后,与稳定型冠心病发病风险的关联强度并没有显著提升,可表明本发明的肠道菌群特征因素可独立于传统临床危险因素(总胆固醇水平、糖尿病和年龄)之外用于评估稳定型冠心病发病风险。
实施方案二:急性冠脉综合征发病风险评估
为了对急性冠脉综合征进行风险预测,提高预测准确率,本发明实施例提供一种建立急性冠脉综合征的风险预测模型的方法,该方法可以包括:
获得急性冠脉综合征患者和健康人群的粪便样本DNA数据;
对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
利用Trimmomatic软件去除肠道菌群宏基因组数据中的接头,并根据预先设定的碱基质量值,对去除接头的肠道菌群宏基因组数据进行修剪;
利用FastQC软件对修剪后的肠道菌群宏基因组数据进行质量评估;
对质量评估合格的肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到急性冠脉综合征患者和健康人群的相对丰度信息;
根据所述相对丰度信息和预先筛选的急性冠脉综合征的生物标志物,确定肠道菌群特征数据,所述急性冠脉综合征的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对急性冠脉综合征患者和健康人群的相对丰度历史信息进行差异分析得到的;
将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到急性冠脉综合征风险预测模型。
进一步,还提供了一种急性冠脉综合征的风险预测的方法,该方法包括:
利用所述急性冠脉综合征风险预测模型进行急性冠脉综合征的风险预测。
具体实施时,差异菌相对丰度历史信息是对急性冠脉综合征患者和健康人群的相对丰度历史信息进行差异分析得到的,包括:所述差异菌相对丰度历史信息是利用LDA Effect Size软件对急性冠脉综合征患者和健康人群的相对丰度历史信息进行差异分析得到的。
其他未详细注明的具体操作可参照实施方案一。
下面给出一个具体实施例,说明本发明急性冠脉综合征的风险评估方法的具体应用。
1、临床入组标准:
依据冠状动脉粥样硬化性心脏病的临床特点,将病人分为2组,包括:(1)ST段抬高急性心肌梗死(STEMI,不稳定斑块破裂组,心肌坏死);非ST段抬高急性心肌梗死(NSTEMI,不稳定斑块部分破裂组,心肌少量坏死)和不稳定心绞痛(UAP,斑块濒临破裂或破裂前不稳 定组,心肌微量坏死),即ACS组,N=212;(2)无动脉粥样硬化斑块的正常对照组,即normal coronary artery组,NCA,N=175。在临床信息收集的基础上,采集各组人群新鲜或妥善冷冻的粪便,进行肠道宏基因组测序。
2、实施方法:
共有387名参与者在国家心血管病中心、中国医学科学院阜外医院参加了本次研究。根据诊断指南和排除标准将其分为以下两组:NCA组(N=175),ACS组(N=212)。
在患者入院的第二天上午,空腹时间大于10小时的条件下采集病人的血液样本,由阜外医院完成相关临床常规生化指标检测,所有检测均按照国际标准方法进行。同时收集患者粪便样本,并在30分钟内放入干冰保存,并尽快储存在-80℃冰箱中待测。提取DNA,对提取的核酸物质利用琼脂糖凝胶方法进行质量控制。要求DNA总量≥1μg,DNA总浓度≥20ng/μL。对质量合格的样本进行建库,illumina hiseq4000双端测序。获取原始宏基因组双端测序数据后,用Trimmomatic软件对数据进行质量控制,去除低质量序列和接头。并用FastQC软件评价质控后的数据。对质控后的数据,采用MetaPhIAn2软件进行宏基因组物种注释分析。获取癌症患者与正常人肠道菌群的物种的丰度信息后,分析物种多样性,并采用LEfSe(LDA Effect Size)分析组间菌群差异,获得急性冠脉综合征患者肠道菌群的特征,在物种水平建立模型进行评估。采用LightGBM的机器学习方法建模及十乘十交叉验证的方法,将数据随机分成训练集和测试集。首先采用boruta算法进行特征选择。采用GridSearchCV(网格搜索)和Hyperopt不断调整参数,选择最优的参数。重新获取一批从未参与建模的外部数据,将构建好的模型用于评估这批数据,通过AUROC来判断评估模型的好坏。特征的重要性用其对模型的贡献度表示。所有的分析采用Python的scikit-learn包。
图6为训练集中的AUROC曲线图,图7为筛到的对模型起重要作用的急性冠脉综合征的生物标志物。各生物标志物均为急性冠脉综合征发病风险因素,用于评估急性冠脉综合征发病风险时的特征重要度参见图7。如果某一项或多项生物标志物相比于健康人的表达丰度差异越大,则个体急性冠脉综合征发病风险越高。
图8显示了在本发明的肠道菌群特征因素的基础上,进一步整合传统认为与急性冠脉综合征风险密切相关的总胆固醇水平、年龄和高血压因素,所获得的用于对急性冠脉综合征发病风险进行评估的模型的AUROC曲线。将其与图6相比,可以看出,进一步整合总胆固醇水平、年龄和血压因素后,与急性冠脉综合征发病风险的关联强度并没有特别显著地提升,可表明本发明的肠道菌群特征因素可独立于传统临床危险因素(总胆固醇水平、年龄和高血压)之外用于评估急性冠脉综合征发病风险。
实施方案三:针对稳定型冠心病的急性冠脉综合征发病风险评估
为了对急性冠脉综合征进行风险评估预测,提高预测准确率,本发明实施例提供一种建立针对稳定型冠心病的急性冠脉综合征风险预测模型的方法,该方法可以包括:
获得急性冠脉综合征患者和稳定型冠心病患者的粪便样本DNA数据;
利用琼脂糖凝胶方法确定所述粪便样本DNA数据的总量数据和总浓度数据;
将所述总量数据与总浓度数据与预设阈值进行比较,根据比较的结果对所述粪便样本DNA数据进行筛选;
对筛选出的粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到急性冠脉综合征患者和稳定型冠心病患者的相对丰度信息;
根据所述相对丰度信息和预先筛选的急性冠脉综合征的生物标志物,确定肠道菌群特征数据,所述急性冠脉综合征的生物标志物是根据差异菌相对丰度历史信息进行预先筛选的,所述差异菌相对丰度历史信息是对急性冠脉综合征患者和稳定型冠心病患者的相对丰度历史信息进行差异分析得到的;
将所述肠道菌群特征数据输入预先建立的机器学习模型中进行训练,得到急性冠脉综合征风险预测模型。
进一步,还提供了一种针对稳定型冠心病的急性冠脉综合征的风险评估的方法,该方法包括:
利用所述急性冠脉综合征风险预测模型进行针对稳定型冠心病的急性冠脉综合征风险预测。
具体实施时,所述差异菌相对丰度历史信息是对急性冠脉综合征患者和稳定型冠心病患者的相对丰度历史信息进行差异分析得到的,包括:所述差异菌相对丰度历史信息是利用LDA Effect Size软件对急性冠脉综合征患者和稳定型冠心病患者的相对丰度历史信息进行差异分析得到的。
其他未详细注明的具体操作可参照实施方案一。
下面给出一个具体实施例,说明本发明针对稳定型冠心病的急性冠脉综合征风险预测方法的具体应用。
1、临床入组标准:
依据冠状动脉粥样硬化性心脏病的临床特点,将病人分为2组,包括:(1)ST段抬高急性心肌梗死(STEMI,不稳定斑块破裂组,心肌坏死);非ST段抬高急性心肌梗死(NSTEMI,不稳定斑块部分破裂组,心肌少量坏死)和不稳定心绞痛(UAP,斑块濒临破裂或破裂前不稳 定组,心肌微量坏死),即ACS组,N=212;(2)稳定性CAD组(斑块稳定组),即stable CAD组,N=213。在临床信息收集的基础上,采集各组人群新鲜或妥善冷冻的粪便,进行肠道宏基因组测序。
2、实施方法:
共有425名参与者在国家心血管病中心、中国医学科学院阜外医院参加了本次研究。根据诊断指南和排除标准将其分为以下两组:sCAD组(N=213),ACS组(N=212)。
在患者入院的第二天上午,空腹时间大于10小时的条件下采集病人的血液样本,由阜外医院完成相关临床常规生化指标检测,所有检测均按照国际标准方法进行。同时收集患者粪便样本,并在30分钟内放入干冰保存,并尽快储存在-80℃冰箱中待测。提取DNA,对提取的核酸物质利用琼脂糖凝胶方法进行质量控制。要求DNA总量≥1μg,DNA总浓度≥20ng/μL。对质量合格的样本进行建库,illumina hiseq4000双端测序。获取原始宏基因组双端测序数据后,用Trimmomatic软件对数据进行质量控制,去除低质量序列和接头。并用FastQC软件评价质控后的数据。对质控后的数据,采用MetaPhIAn2软件进行宏基因组物种注释分析。获取癌症患者与正常人肠道菌群的物种的丰度信息后,分析物种多样性,并采用LEfSe(LDA Effect Size)分析组间菌群差异,获得急性冠脉综合征患者肠道菌群的特征,在物种水平建立模型进行预测。采用LightGBM的机器学习方法建模及十乘十交叉验证的方法,将数据随机分成训练集和测试集。首先采用boruta算法进行特征选择。采用GridSearchCV(网格搜索)和Hyperopt不断调整参数,选择最优的参数。重新获取一批从未参与建模的外部数据,将构建好的模型用于预测这批数据,通过AUROC来判断预测模型的好坏。特征的重要性用其对模型的贡献度表示。所有的分析采用Python的scikit-learn包。
图9为训练集中的AUROC曲线图,图10为筛到的对模型起重要作用的急性冠脉综合征的生物标志物。各生物标志物均为急性冠脉综合征发病风险因素,用于评估急性冠脉综合征发病风险时的特征重要度参见图10。如果某一项或多项生物标志物相比于健康人的表达丰度差异越大,则个体急性冠脉综合征发病风险越高。
图11显示了在本发明的肠道菌群特征因素的基础上,进一步整合传统认为与急性冠脉综合征风险密切相关的总胆固醇水平和年龄因素,所获得的用于对急性冠脉综合征发病风险进行评估的模型的AUROC曲线。将其与图9相比,可以看出,进一步整合总胆固醇水平和年龄因素后,与急性冠脉综合征发病风险的关联强度并没有特别显著地提升,可表明本发明的肠道菌群特征因素可独立于传统临床危险因素(总胆固醇水平和年龄)之外用于评估针对稳定型冠心病的急性冠脉综合征发病风险。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。 因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
最后应说明的是:以上所述实施例,仅为本发明的具体实施方式,用以说明本发明的技术方案,而非对其限制,本发明的保护范围并不局限于此,尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。

Claims (13)

  1. 检测个体信息的试剂在制备心血管病发病风险评估装置中的应用,其中,所述个体信息包括肠道菌群信息,所述肠道菌群包括至少10种肠道菌种,且所述肠道菌种为基于心血管病患者与健康人的肠道菌群宏基因组数据筛选的差异菌种。
  2. 根据权利要求1所述的应用,其中,所述心血管病为稳定型冠心病、急性冠脉综合征或针对稳定型冠心病的急性冠脉综合征。
  3. 根据权利要求2所述的应用,其中:
    (1)当所述心血管病为稳定型冠心病时,所述肠道菌群包括:
    马赛拟杆菌(Bacteroides massiliensis),未分类伊格尔兹氏菌(Eggerthella unclassified),肺炎克雷伯菌(Klebsiella pneumoniae),未分类梭状杆菌(Oscillibacter unclassified),未分类副雷沃菌(Paraprevotella unclassified),毛螺旋菌科_5_1_63FAA(Lachnospiraceae bacterium_5_1_63FAA),粪厌氧棒状菌(Anaerostipes hadrus),未分类嗜胆汁菌(Bilophila unclassified),人罗斯拜瑞氏菌(Roseburia hominis),腹真杆菌(Eubacterium ventriosum),人体普氏菌(Prevotella copri),肠巴氏杆菌(Barnesiella intestinihominis),木茴香类杆菌(Bacteroides xylanisolvens),真杆菌(Eubacterium hallii),巨单胞菌未分类(Megamonas unclassified),胸膜类杆菌(Bacteroides plebeius),副杆菌(Parabacteroides distasonis)以及大肠杆菌(Escherichia coli);
    (2)当所述心血管病为急性冠脉综合征时,所述肠道菌群包括:
    长双歧杆菌(Bifidobacterium longum),毛螺旋菌科_5_1_63FAA(Lachnospiraceae bacterium_5_1_63FAA),另枝菌属(Alistipes onderdonkii),产气柯林斯菌(Collinsella aerofaciens),真杆菌(Eubacterium eligens),普氏栖粪杆菌(Faecalibacterium prausnitzii),普通拟杆菌(Bacteroides vulgatus),未分类颤杆菌(Oscillibacter unclassified),卵形拟杆菌(Bacteroides ovatus)以及凸腹真桿菌(Eubacterium ventriosum);
    (3)当所述心血管病为针对稳定型冠心病的急性冠脉综合征时,所述肠道菌群包括:
    长双歧杆菌(Bifidobacterium longum),咽峡炎链球菌(Streptococcus anginosus),陪伴粪球菌(Coprococcus comes),产气柯林斯菌(Collinsella aerofaciens),普氏栖粪杆菌(Faecalibacterium prausnitzii),卵形拟杆菌(Bacteroides ovatus),厌氧棍状菌属(Anaerotruncus colihominis),脆弱拟杆菌(Bacteroides fragilis),霍尔德曼氏菌(Holdemania filiformis),直肠真杆菌(Eubacterium rectale)以及唾液链球菌(Streptococcus salivarius)。
  4. 根据权利要求3所述的应用,其中,所述肠道菌群中各菌在评估心血管病发病风险时的特征重要度按照以下顺序:
    马赛拟杆菌﹥未分类伊格尔兹氏菌﹥肺炎克雷伯菌﹥未分类梭状杆菌=未分类副雷沃菌﹥毛螺旋菌科_5_1_63FAA﹥粪厌氧棒状菌﹥未分类嗜胆汁菌﹥人罗斯拜瑞氏菌=腹真杆菌=人体普氏菌﹥肠巴氏杆菌﹥木茴香类杆菌=真杆菌﹥巨单胞菌未分类=胸膜类杆菌=副杆菌﹥大肠杆菌;或者
    长双歧杆菌﹥毛螺旋菌科_5_1_63FAA﹥另枝菌属﹥产气柯林斯菌﹥真杆菌﹥普氏栖粪杆菌﹥普通拟杆菌=未分类颤杆菌﹥卵形拟杆菌﹥凸腹真桿菌;或者
    长双歧杆菌﹥咽峡炎链球菌﹥陪伴粪球菌=产气柯林斯菌﹥普氏栖粪杆菌﹥卵形拟杆菌=厌氧棍状菌属﹥脆弱拟杆菌﹥霍尔德曼氏菌﹥直肠真杆菌=唾液链球菌。
  5. 根据权利要求3所述的应用,其中,所述肠道菌群中各菌在评估心血管病发病风险时,所述肠道菌群中各菌按照以下特征重要度数值确定权重,或者,所述肠道菌群中各菌的权重比值为以下第一组、第二组或第三组:
    第一组:马赛拟杆菌,23;未分类伊格尔兹氏菌,19;肺炎克雷伯菌,16;未分类梭状杆菌,15;未分类副雷沃菌,15;毛螺旋菌科_5_1_63FAA,13;粪厌氧棒状菌,11;未分类嗜胆汁菌,10;人罗斯拜瑞氏菌,8;腹真杆菌,8;人体普氏菌,8;肠巴氏杆菌,6;木茴香类杆菌,5;真杆菌,5;巨单胞菌未分类,4;胸膜类杆菌,4;副杆菌,4;大肠杆菌,1;
    第二组:长双歧杆菌,47;毛螺旋菌科_5_1_63FAA,44;另枝菌属,43;产气柯林斯菌,32;真杆菌,31;普氏栖粪杆菌,30;普通拟杆菌,28;未分类颤杆菌,28;卵形拟杆菌,20;凸腹真桿菌,14
    第三组:长双歧杆菌,13;咽峡炎链球菌,11;陪伴粪球菌,10;产气柯林斯菌,10;普氏栖粪杆菌,9;卵形拟杆菌,8;厌氧棍状菌属,8;脆弱拟杆菌,7;霍尔德曼氏菌,6;直肠真杆菌,4;唾液链球菌,4。
  6. 根据权利要求1所述的应用,其中,所述个体信息还包括总胆固醇水平、高血压、糖尿病、年龄中的一项或多项。
  7. 根据权利要求1-6任一项所述的应用,其中,所述个体来自东亚人群。
  8. 一种心血管病发病风险评估装置,其包括检测单元和数据分析单元,其中:
    所述检测单元用于检测个体信息,获得检测结果;其中,所述个体信息同权利要求1-5任一项中所述个体信息;
    所述数据分析单元用于对检测单元的检测结果进行分析处理。
  9. 根据权利要求8所述的心血管病发病风险评估装置,其中,所述检测单元包括检测粪便样本DNA数据的试剂材料。
  10. 根据权利要求9所述的心血管病发病风险评估装置,其中,所述数据分析单元用于对检测单元的检测结果进行分析处理的过程包括:
    对所述粪便样本DNA数据进行双端测序处理,得到肠道菌群宏基因组数据;
    对所述肠道菌群宏基因组数据进行物种注释分析和功能注释分析,得到肠道菌群中各菌的相对丰度信息;
    根据所述相对丰度信息,确定肠道菌群特征数据。
  11. 根据权利要求8所述的心血管病发病风险评估装置,其中:
    所述数据分析单元对检测单元的检测结果进行分析处理时,包括:将个体信息的检测结果配以权重系数,以计算所述待测个体的风险评估得分。
  12. 一种计算机设备,其包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:基于待测个体信息获得个体心血管病发病风险评估结果;
    其中,所述个体信息同权利要求1至5中任一所述个体信息。
  13. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序指令,所述计算机程序指令被执行时实现:基于待测个体信息获得个体心血管病发病风险评估结果;
    其中,所述个体信息同权利要求1-5中任一项所述个体信息。
PCT/CN2022/075241 2021-02-05 2022-01-30 心血管病发病风险评估肠道菌群标志物及其应用 WO2022166934A1 (zh)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN202110157645.6 2021-02-05
CN202110157644.1 2021-02-05
CN202110157590.9A CN112509635A (zh) 2021-02-05 2021-02-05 针对稳定型冠心病的急性冠脉综合征风险预测方法及装置
CN202110157590.9 2021-02-05
CN202110157645.6A CN112509701A (zh) 2021-02-05 2021-02-05 急性冠脉综合征的风险预测方法及装置
CN202110157644.1A CN112509700A (zh) 2021-02-05 2021-02-05 稳定型冠心病的风险预测方法及装置

Publications (1)

Publication Number Publication Date
WO2022166934A1 true WO2022166934A1 (zh) 2022-08-11

Family

ID=82740859

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/075241 WO2022166934A1 (zh) 2021-02-05 2022-01-30 心血管病发病风险评估肠道菌群标志物及其应用

Country Status (1)

Country Link
WO (1) WO2022166934A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028312A1 (en) * 2015-08-20 2017-02-23 Bgi Shenzhen Biomarkers for coronary heart disease
CN107075563A (zh) * 2014-09-30 2017-08-18 深圳华大基因科技有限公司 用于冠状动脉疾病的生物标记物
CN110392741A (zh) * 2016-12-16 2019-10-29 Md保健株式会社 通过细菌宏基因组分析来诊断心脏病的方法
CN112509701A (zh) * 2021-02-05 2021-03-16 中国医学科学院阜外医院 急性冠脉综合征的风险预测方法及装置
CN112509635A (zh) * 2021-02-05 2021-03-16 中国医学科学院阜外医院 针对稳定型冠心病的急性冠脉综合征风险预测方法及装置
CN112509700A (zh) * 2021-02-05 2021-03-16 中国医学科学院阜外医院 稳定型冠心病的风险预测方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107075563A (zh) * 2014-09-30 2017-08-18 深圳华大基因科技有限公司 用于冠状动脉疾病的生物标记物
WO2017028312A1 (en) * 2015-08-20 2017-02-23 Bgi Shenzhen Biomarkers for coronary heart disease
CN110392741A (zh) * 2016-12-16 2019-10-29 Md保健株式会社 通过细菌宏基因组分析来诊断心脏病的方法
CN112509701A (zh) * 2021-02-05 2021-03-16 中国医学科学院阜外医院 急性冠脉综合征的风险预测方法及装置
CN112509635A (zh) * 2021-02-05 2021-03-16 中国医学科学院阜外医院 针对稳定型冠心病的急性冠脉综合征风险预测方法及装置
CN112509700A (zh) * 2021-02-05 2021-03-16 中国医学科学院阜外医院 稳定型冠心病的风险预测方法及装置

Similar Documents

Publication Publication Date Title
CN114292931B (zh) 急性冠脉综合征的风险评估标志物及其应用
CN114438165B (zh) 针对稳定型冠心病的急性冠脉综合征风险评估标志物及应用
CN114360726B (zh) 稳定型冠心病发病风险评估标志物及其应用
Leligdowicz et al. Heterogeneity in sepsis: new biological evidence with clinical applications
Yousefi et al. DNA methylation-based predictors of health: applications and statistical considerations
Dhingra et al. Biomarkers in cardiovascular disease: Statistical assessment and section on key novel heart failure biomarkers
Bellavia et al. Independent predictors of survival in primary systemic (Al) amyloidosis, including cardiac biomarkers and left ventricular strain imaging: an observational cohort study
US20150317444A1 (en) Identification of a Person Having Risk for Developing Type 2 Diabetes
JP2017506510A (ja) 敗血症の発症を予測するための装置、キット及び方法
JP2022527653A (ja) 疾患を診断する方法
CN111505288B (zh) 一种新的抑郁症生物标志物及其应用
Eggers et al. Artificial neural network algorithms for early diagnosis of acute myocardial infarction and prediction of infarct size in chest pain patients
Shu et al. Clinical application of machine learning-based artificial intelligence in the diagnosis, prediction, and classification of cardiovascular diseases
Pérez-Carrillo et al. Diagnostic value of serum miR-144-3p for the detection of acute cellular rejection in heart transplant patients
Wang et al. Prediction of the severity of acute kidney injury after on-pump cardiac surgery
Stähli et al. Clinical criteria replenish high-sensitive troponin and inflammatory markers in the stratification of patients with suspected acute coronary syndrome
Liu et al. Prognostic role of heart-type fatty acid binding protein in pulmonary embolism: a meta-analysis
Kuluöztürk et al. Endocan as a marker of disease severity in pulmonary thromboembolism
WO2022166934A1 (zh) 心血管病发病风险评估肠道菌群标志物及其应用
CN111020020A (zh) 一种精神分裂症的生物标志物组合、其应用及metaphlan2筛选方法
Kayvanpour et al. microRNA neural networks improve diagnosis of acute coronary syndrome (ACS)
Sladojević et al. Data mining approach for in-hospital treatment outcome in patients with acute coronary syndrome
Alehagen et al. Natriuretic peptide biomarkers as information indicators in elderly patients with possible heart failure followed over six years: a head-to-head comparison of four cardiac natriuretic peptides
Yan et al. Ensemble learning-based mortality prediction after acute myocardial infarction
CN111020021A (zh) 一种基于肠道菌群的小规模精神分裂症生物标志物组合、其应用及mOTU筛选方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22749213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22749213

Country of ref document: EP

Kind code of ref document: A1