CN112852916A

CN112852916A - Marker combination for intestinal microecology, auxiliary diagnosis model and application of marker combination

Info

Publication number: CN112852916A
Application number: CN202110192133.3A
Authority: CN
Inventors: 王普清; 毛良伟
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2021-05-28

Abstract

The invention discloses a marker combination of intestinal microecology, an auxiliary diagnosis model and application thereof. The marker combinations include the following microorganisms: scatavia (Scardovia), ruminococcus (unnamed) (ruminococcus noname), cholephilus (Bilophila), Bacteroides (Bacteroides), gemfibrococcus (gemellal), Alistipes (Alistipes), Oxalobacter (Oxalobacter), sorafella (Solobacterium), Bifidobacterium (bifidum) and clostridium (unnamed) (clostridium noname). The model of the invention can provide high-accuracy noninvasive auxiliary diagnosis of the Parkinson's disease, and the accuracy can reach 80.3 percent.

Description

Marker combination for intestinal microecology, auxiliary diagnosis model and application of marker combination

Technical Field

The invention relates to the fields of medicine, biology and bioinformatics, in particular to a marker combination for intestinal microecology, an auxiliary diagnosis model and application thereof.

Background

Parkinson's Disease (PD) is a multifocal and progressive neurodegenerative disease. This disease is second only to alzheimer's disease as the second major neurodegenerative disease. By 2015, the number of patients worldwide was 620 million, of which 11.7 million died from parkinson's disease. Pathologically, PD is characterized primarily by degeneration of nigral dopaminergic neurons, striatal dopamine depletion, and the formation of abnormal protein aggregate lewy bodies within neurons. Clinically, the main features of PD are resting tremor, rigidity, bradykinesia and gait abnormalities, which are also considered to be the "four-key" signs of PD. Other characteristics include frozen gait, postural instability, dysphasia, autonomic dysfunction, paresthesia, mood disorders, sleep disorders, cognitive decline and dementia. Many PD patients often have manifestations of gastrointestinal dysfunction before they develop dyskinesia. A range of PD-associated gastrointestinal dysfunction has been identified clinically, including weight loss, gastroparesis, constipation and dyschezia. In recent years, metagenomic studies further discuss the relevance of Parkinson's disease and intestinal flora abnormality, and can be said to be the extension of the gastrointestinal tract hypothesis on the aspect of intestinal flora.

The intestinal flora consists of bacterial communities in the gastrointestinal tract that are symbiotic with the human host. The development of the intestinal flora is influenced by many factors, such as diet, antibiotic treatment, type of delivery and breast feeding. A healthy and stable intestinal flora plays a crucial role in maintaining intestinal barrier integrity, a homeostatic balance of function, metabolism and immunity, and regulating the gut-brain axis. Recent studies have highlighted the effects of the gut flora on the gut-brain axis and its potential role in central nervous system related disorders and neuropsychiatric disorders such as multiple sclerosis, autism, depression and schizophrenia. Intestinal flora and microbial metabolites are known to significantly interfere with metabolism, cognition, behavior and immunity of the host, and thus the role of intestinal flora and microbial metabolites in the pathogenesis of PD is of increasing interest and has recently shown some phenotypic correlations. For example, changes in the number and composition of gut microbiota and microbial metabolites are found in PD patients. Therefore, understanding the early interactions between the gut flora and the development of PD will open new avenues for intervention, especially for early diagnosis and treatment of PD.

At present, diagnostic models of diseases based on the intestinal flora, such as diagnostic models for colorectal cancer, ulcerative enteritis and predictive models for coronary artery diseases, are reported, but except for the diagnostic model of the intestinal flora of alzheimer disease, drugs for treating degenerative neurological diseases against the intestinal flora, such as treatment of alzheimer disease by GV-971, and diagnostic models of the intestinal flora against diseases of the central nervous system, are mostly developed. Due to the lack of early diagnostic markers in Parkinson's disease, most Parkinson's disease patients are diagnosed in an advanced stage, and the prognosis is poor. Considering that the diagnosis of parkinson's disease requires a complicated scale and the experience of doctors for judgment, finding a novel diagnostic marker and an efficient diagnostic model for parkinson's disease is urgently needed to improve prognosis.

From Rehman A et al, geological patterns of the standing and active human genome in health and IBD. Gut 65, 238-shaped 248(2016), Kushuulova A et al, Metagenomics of the gut microbiology from a Central aspect BMJ Open 8, e021682(2018) and Descraaux M et al, mapping the composition of gut microbiology in a population with varied dietary orientation but buried in the genome, D24, D1526-shaped 1531(2018), it is known that the gut flora has very large correlation with diet and human species, the Western is very different from the dietary structure, so a more precise method for diagnosing the population is necessary (S16).

Although in mainland china, the gut microbiome diversity has been analyzed by organizations in five cities, beijing, shanghai, guangzhou, vinblastic and jin using 16S rRNA amplicon sequencing technology. However, large-scale population research shows that the diagnosis model of the intestinal flora diseases has very obvious regional dependence, and different diseases are influenced by regional factors differently. Therefore, it is necessary to find a diagnostic marker of the intestinal flora of the parkinson disease and construct an auxiliary diagnostic model for the application of the auxiliary diagnosis of the parkinson disease to the population in the selected region.

Disclosure of Invention

In order to solve the technical problem that a high-accuracy noninvasive Parkinson disease intestinal flora diagnosis marker and an auxiliary diagnosis model are lacked in the prior art, the invention provides an intestinal microecological marker combination, an auxiliary diagnosis model and application thereof, the diagnosis marker is selected based on the intestinal flora to detect the Parkinson disease, the marker targeting the intestinal microecological can be used as a potential Parkinson disease noninvasive diagnosis tool in a certain area, and the diagnosis accuracy of the Parkinson disease can reach 80.3%.

The inventor finds that the relative abundance information of intestinal microorganisms of Parkinson disease patients and healthy people is mostly mapped to the bacterial kingdom in the metagenomic analysis of collected samples; further diversity analysis showed that the a diversity was higher in parkinson patients than in healthy persons at the genus and species level, and that the disease state was associated with changes in intestinal microorganisms; in conjunction with the beta diversity assessment, it was found that at different classification levels, the difference at the high classification level was more pronounced than the difference at the low classification level. Therefore, the inventors selected microorganisms having genus levels significantly different among groups, selected a random forest model, verified the role of the microorganisms having genus levels significantly different in predicting the types of samples to be tested, and constructed a parkinson's disease auxiliary diagnosis model based on the random forest model.

A first aspect of the invention provides a marker combination for gut microbiology, the marker combination comprising the following microorganisms: scatavia (Scardovia), ruminococcus (unnamed) (ruminococcus noname), cholephilus (Bilophila), Bacteroides (Bacteroides), gemfibrococcus (gemellal), Alistipes (Alistipes), Oxalobacter (Oxalobacter), sorafella (Solobacterium), Bifidobacterium (bifidum) and clostridium (unnamed) (clostridium noname).

In a preferred embodiment of the present invention, the marker combination further comprises: rosellia (Roseburia), anaerobic Corynebacterium (Anaerostipes), ParaSalmonella (Parastutterella), Megamonas (Megamonas), Klebsiella (Klebsiella), butyric acid monad (Butyrimonas), Coriolis (Collinsella), Shigella (Shigella), rare Chlorella (Subdoligranum) and Flavobacterium (Flavonibacter).

The marker combination is suitable for Hubei Xiangyang areas.

A second aspect of the invention provides a combination of reagents comprising reagents capable of detecting a combination of markers as described in the first aspect.

In a preferred embodiment of the present invention, the reagent combination comprises a reagent for PCR or a reagent for sequencing.

Preferably, the reagent combination comprises a marker combination as described in the first aspect.

A third aspect of the invention provides the use of a marker combination as described in the first aspect or a combination of reagents as described in the second aspect for the preparation of a diagnostic agent for the diagnosis of parkinson's disease.

A fourth aspect of the invention provides a diagnostic aid model comprising:

(1) the input module is used for inputting the information of the microbial taxonomy characterization and the relative abundance of the sample to be detected to obtain flora of the genus of top 10 or top 20 based on the average accuracy reduction method;

(2) the processing module calls a prediction function by adopting a random forest classifier, and predicts the source of the sample to be detected based on the flora of the genus of the top 10 or the top 20 in the step (1);

the random forest classifier obtains characteristic flora of the known sample based on the microbial taxonomy characterization and the relative abundance information of the known sample;

the definition of the random forest classifier is as follows: randomForest (class, data _ df, ntree, nPerm, 50, mtry floor (sqrt (ncol _ df) -1)), promimity T, and import T); wherein class is a dataset of information on taxonomic characterization and relative abundance of microorganisms for the known sample;

the prediction function is defined as follows: predict (rf, newdata ═ test _ df, type ═ response "); wherein test _ df is the information in (1).

In a preferred embodiment of the present invention, the characteristic flora is a flora composed of top-10 genera obtained based on the average accuracy reduction method; and/or, the information on the taxonomic characterization and relative abundance of microorganisms of the known sample and the sample to be tested is obtained by a microbiome metagenomic analysis such as metaphan 2.

In a more preferred embodiment of the present invention, the characteristic flora is a flora consisting of top-20 genera obtained based on the mean accuracy reduction method.

The sample to be tested may be intestinal secretions conventional in the art, preferably faeces, e.g. from a subject in the region of Xiangyang, Hubei.

In a preferred embodiment of the present invention, the auxiliary diagnostic model further comprises (0) a pre-processing module, and/or (3) an output module, wherein the pre-processing module performs extraction, library construction and sequencing on the DNA of the sample, obtains a metagenome original reading of the DNA of the sample, removes noise, and transmits the noise-removed information to the input module; the output module is used for outputting the prediction result of the processing module;

wherein, the noise removal means: and performing quality inspection on the metagenome original reading, and trimming the low-quality sequence to obtain the metagenome reading of the microbial DNA of the sample to be detected.

Preferably, the quality inspection is realized by second-generation sequencing quality control software such as FastQC, SolexaQA or PRINSEQ; and/or, the pruning of low quality sequences is achieved by metagenomic sequencing quality control software such as KneadData;

more preferably, the parameters of the kneadData are set as follows: "SLIDINGWINDOW: 4: 20 MINLENEN: 50 "; and/or, the denoising further comprises: deleting unwanted human DNA reads after pruning low quality sequences, said unwanted human DNA reads being deleted with the parameter "very-sensitive-dovetail".

A fifth aspect of the invention provides a method of obtaining a characteristic population of a known sample, comprising: obtaining a characteristic flora by adopting a random forest classifier based on the microbial taxonomy characterization and relative abundance information of a known sample;

wherein the random forest classifier is defined as follows: randomForest (class, data _ df, ntree, nPerm, 50, mtry floor (sqrt (ncol _ df) -1)), promimity T, and import T); wherein class is a data set of information on the taxonomic characterization and relative abundance of microorganisms for the known sample.

In a preferred embodiment of the present invention, the characteristic flora is a flora composed of top-10 genera obtained based on the average accuracy reduction method.

And/or, the information on the taxonomic characterization and relative abundance of microorganisms of the known sample is obtained by a microbiome metagenomic analysis, such as metaphan 2.

In an embodiment of the invention, the method further comprises evaluating the accuracy of the random forest classifier.

In one embodiment of the invention, the accuracy of the random forest classifier is assessed by cross validation; the cross-validation is preferably selected from simple cross-validation, k-fold cross-validation, or leave-one cross-validation, such as leave-one cross-validation.

The advantage of leave-one-cross validation is that the maximum possible number of samples are used in each iteration for training, so the method is deterministic. With this maximum possible number of cross-validations, a more accurate classifier may be obtained.

In a preferred embodiment of the present invention, the number of decision trees of the random forest classifier is 1000(ntree ═ 1000), the number of preselected feature variables per node of each tree is the number of columns of the matrix, minus one, and the seed is set to 2019613.

The random forest classifier and the cross validation are completed through an R language.

In a preferred embodiment of the invention, the information on relative abundance is the difference in abundance of the bacterial population at different taxonomic levels assessed based on α diversity and β diversity.

Preferably, the method for assessing alpha diversity is a t test, preferably a Student's t test; the beta diversity assessment method comprises the following steps: genus abundance nonparametric permutation multivariate analysis of variance (PERMANOVA) and principal coordinates analysis (PCoA) based on the Bray-Curtis distance.

The nonparametric replacement multivariate analysis of variance preferably evaluates the clustering conditions of the samples under the prediction factors such as disease conditions, sexes, ages and the like; for example using vegan 2.5-4 package.

The principal coordinate analysis (PCoA) visualizes the clustering of the samples.

The alpha diversity calculation method comprises the following steps: shannon (Shannon) index and/or species abundance.

In a more preferred embodiment of the present invention, the method further comprises a pretreatment step of: extracting, constructing a library and sequencing the DNA of the known sample to obtain the metagenome original reading of the DNA of the known sample and removing noise;

wherein, the noise removal means: and performing quality inspection on the metagenome original reading, and trimming the low-quality sequence to obtain the metagenome reading of the microbial DNA of the known sample.

The quality inspection is realized by second-generation sequencing quality control software; preferably FastQC, SolexaQA or PRINSEQ; for example FastQC.

The pruning of the low-quality sequence is realized by metagenome sequencing quality control software; preferably KneadData.

The parameters of the KneadData are set as follows: "SLIDINGWINDOW: 4: 20 MINLENEN: 50 "; the parameter for deleting the unwanted human DNA reads is "- - -very-positive- -dovetail".

The denoising further includes: deleting unwanted human DNA reads after trimming low quality sequences; the parameter for deleting the unwanted human DNA reads is "- - -very-positive- -dovetail".

A sixth aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the functions of the auxiliary diagnostic model according to the fourth aspect.

On the basis of the common knowledge in the field, the above preferred conditions can be combined randomly to obtain the preferred embodiments of the invention.

The positive progress effects of the invention are as follows:

the model of the invention can provide high-accuracy noninvasive Parkinson disease auxiliary diagnosis based on the selected intestinal flora diagnosis marker, and the accuracy can reach 80.3%. If the patient can roughly know the own intestinal composition while making a diagnosis, the later treatment can be performed with a larger effect.

Drawings

FIG. 1 is an analysis of α and β diversity for example 1;

wherein: (a) species abundance of PD group and SP group at genus level, (b) shannon index of PD group and SP group at genus level, (c) species abundance of PD group and SP group at species level, (d) shannon index of PD group and SP group at species level, (e) PCoA analysis of Bray-Curtis distance between samples;

FIG. 2 shows the abundance difference between the PD group and the SP group.

FIG. 3 shows the abundance difference between the PD group and the SP group.

FIG. 4 is a differential bacterial group clade plot of the gut microbiome of the PD group and the SP group.

Fig. 5 is the top 10 most important genera of characteristic bacteria for the diagnostic model, determined by the random forest classifier MDA.

Fig. 6 is the top 20 most important genera of characteristic bacteria for the diagnostic model, determined by the random forest classifier MDA.

FIG. 7 is a ROC curve predicting the occurrence of PD in a patient cohort.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention. The experimental methods without specifying specific conditions in the following examples were selected according to the conventional methods and conditions, or according to the commercial instructions.

Example 1

The embodiment comprises the following steps:

first, select the patient queue

This example is a cross-sectional study with 78 subjects all enrolled from the city of Xiangyang, Hubei province. In order to reduce the potential influence of factors such as diet and rest, the subjects were couples, i.e. one of them was parkinson patients (PD group) and the other was control (SP group).

The diagnostic criteria for PD were referred to in MDS (International society for dyskinesia) diagnosis in 2015. The primary core criteria for diagnosis is to determine whether a patient is parkinsonism, and if the patient exhibits bradykinesia combined with resting tremor and/or muscular rigidity, the patient is considered to have parkinsonism. Once the patient is definitely diagnosed as the Parkinson's disease, the patient needs to be diagnosed according to the support standard, the exclusion standard and the warning sign, and is determined as the clinically probable PD patient.

Exclusion criteria for the PD group were:

(1) eliminating the administration or infusion of antibiotics or probiotics for approximately three months;

(2) exclusion of severe gastrointestinal disorders;

(3) eliminating obvious mental diseases;

(4) the exclusion of platelets was 80 x 10 lower⁹/L；

(5) Exclusion of Prothrombin Time (PT) >15 s;

(6) the history of bleeding of any organ was excluded.

No detailed diet schedule was required prior to fecal collection and the sample was the first bowel movement on the day.

Secondly, extracting excrement DNA, constructing DNA library and sequencing

Fecal DNA was extracted according to the protocol provided by MetaHIT, DNA concentration was determined by Qubit (Invitrogen), and a DNA library was constructed according to the manufacturer's (MGI, China) instructions. That is, a sample insert of 350bp paired-end library was constructed using 500ng of DNA and sequenced using the BGISEQ-500 sequencer in the PE100 mode. A total of 1761.8GB of raw sequencing data was obtained for 78 stool samples.

Third, metagenome reading denoising and classification analysis

The shotgun metagenome data was processed according to the SOP of the Microbiome Helper (https:// github. com/LangilleLab/Microbiome _ Helper/wiki/Metagenomics-Tutorial-Humann 2). The fastQC tool was used to check the quality of metagenomic raw reads, the kneadData was used to trim low quality sequences (parameter: "SLIDNGWINDOW: 4: 20 MINLENEN: 50") and to delete unwanted human genomic (HG19) reads (parameter: -very-sensitive-dovetail).

After pruning and filtering by the KneadData software, a total of more than 4.05X 10 is obtained⁹The amount of 100bp high-quality end-paired (paired-end) data, in which the total number of human reads is 4.52X 10⁷The ratio is 1.12%. The mean reads per sample for the PD group after host contamination removal was 5.31X 10⁷±1.58×10⁷SP group of 4.95X 10⁷±2.26×10⁷(Student's t-test, P ═ 0.41). The average number of reads of the host in the PD group was 6.07X 10⁵±1.01×10⁶SP group of 5.53X 10⁵±1.71×10⁶(Student's t-test，P＝0.87)。

The software MetaPhlAn2 used unique clade-specific markers to detect the taxonomic clades present in the microbiome samples and to estimate their relative abundance. The processed readings were taxonomically characterized and abundance estimated using the default parameters of MetaPhlAn2 software.

Most of the readings of the samples examined mapped to the bacterial kingdom, 98.61 ± 5.45% and 99.87 ± 0.41% in the PD and SP groups, respectively (Mann-Whitney U test, P ═ 0.67), with a smaller ratio corresponding to the viral kingdom, 1.36 ± 5.42% in the PD group and 0.13 ± 0.41% in the SP group (Mann-Whitney U test, P ═ 0.89).

Alpha diversity was estimated by shannon index and species abundance. The Student's t test was used to assess alpha diversity.

This example analyzes the species abundance and the aromatic index of the microbiome at the genus and species level, respectively. The genus abundance of the PD group was significantly higher than that of the SP group (53.15 ± 7.69vs.48.56 ± 7.29, Student's t-test, P ═ 0.004) (fig. 1 a). The Shannon index of the PD group was significantly higher than that of the SP group (2.08 ± 0.38vs.1.76 ± 0.42, Student's t-test, P ═ 0.0002) (fig. 1 b). Similar trends were observed at lower taxon levels. The species abundance (115.69 ± 21.07vs.106.26 ± 17.43, Student's t-test, P ═ 0.017) (fig. 1c) and the aromatic index (2.77 ± 0.53vs.2.54 ± 0.51, Student's t-test, P ═ 0.028) (fig. 1d) of the PD group were significantly higher than the SP group. The results show that the diversity of the gut microbiome is significantly higher in PD patients than in healthy people. Thus, the higher gut microbiome abundance and the aromatic index in this example may not be indicative of a healthy gut microbiome.

Beta diversity assessment based on the Bray-Curtis distance matrix, non-parametric permutation multivariate analysis of variance (PERMANOVA) was performed on the genus abundance of all samples to assess the clustering of samples under predictors of disease status, gender, age, etc., and their relationship to the composition of intestinal microorganisms, and finally further visualized using principal coordinate analysis (PCoA) plots to assess the overall difference in microbial communities between the two groups.

PERMANOVA uses vegan 2.5-4 package.

In this example, the disease status is related to changes in intestinal microorganisms among the groups, and the effects of age and sex are relatively independent. The PCoA plot revealed a certain degree of separation of healthy controls from the PD population. The interpretations of the first two primary coordinates are 41.63% and 13.81%, respectively (fig. 1 e).

Differences in abundance of the groups between PD and SP groups were identified by the Linear Discriminant Analysis (LDA) effect size method (Lefse).

Only bacterial taxa with P <0.05(Kruksal-Wallis test) and LDA score >2 were considered significantly enriched.

According to the analysis, the gut microbiome in the sample consisted mainly of 3 phyla, including bacteroides (PD 54.79 ± 16.42%, SP 61.49 ± 12.88%, Mann-Whitney U-test, P ═ 0.09), Firmicutes (PD 28.90 ± 14.76%, SP 30.34 ± 13.17%, Mann-Whitney U-test, P ═ 0.47) and Proteobacteria (PD 12.34 ± 17.36%, SP 7.04 ± 6.82%, Mann-Whitney U-test, P ═ 0.43). It is noteworthy that there was a significant difference between Actinobacteria (PD 1.54 ± 2.11%, SP 0.56 ± 0.77%, Mann-Whitney U-test, P ═ 0.01) and synergestees (PD 2.52 ± 7.26%, SP 0.33 ± 1.12%, Mann-Whitney U-test, P ═ 0.01), and the abundance of the PD group was significantly increased. These results indicate that at high taxonomic levels there is a significant difference in the gut microbiome between the PD and SP groups. This of course also means that corresponding changes may occur at lower classification levels.

As shown in fig. 2-4, a total of 71 bacterial taxa were identified in this example as having abundance differences between the two groups. The Lefse algorithm reveals that there are differences between 1 phylum, 2 classes, 3 orders, 7 families, 14 genera and 44 species. Enrichment at the genus and species level is demonstrated in fig. 2 and 3, respectively. As shown in fig. 4, in the PD group, p _ Actinobacteria, c _ Actinobacteria, o _ bifidobacteria, f _ bifidobacteria and g _ Scardovia were observed to be enriched at different classification levels of the same clade. In addition, taxa c _ Deltaproteobacteria, o _ Desulfovibrionales, f _ Desulfovibrionacee, g _ Desulfovibrio, and g _ Bilophila also exhibited consistent enrichment at different taxonomic levels. In the SP group, f _ Bacteroidaceae and g _ Bacteroides share the same clade and show an enrichment trend and show a similar enrichment trend.

Fourth, the construction of disease auxiliary diagnosis model

To determine fecal bacterial characteristics for disease classification of metagenomic samples, the study used a Random Forest (RF) classifier and leave-one-out cross-validation to evaluate accuracy, i.e., a portion of the samples were selected as validation set and another portion of the samples were used as training set to determine parameters of the random forest and calculate the probability of correct prediction for the validation samples.

A prediction model was constructed based on the relative abundance of the gut microflora of 78 subjects. The number of decision trees in the RF is set to 1000(ntree is 1000), the number of preselected feature variables per node of each tree is the second root of the number of columns of the matrix minus one, and the seed is set to 2019613. And (3) determining the variable with the most classification capability by analyzing Mean increment Accuracy (MDA) and finally establishing a random forest classifier.

An ROC curve is established and the area under the ROC curve (AUC) is calculated for evaluating the accuracy of the new standard on disease prediction.

In the embodiment, the random forest algorithm is used for classifying the samples according to the disease conditions and establishing a diagnosis model. One of the advantages of the random forest algorithm model is that it can estimate the importance of each feature and identify the most important features in the classification process. As shown in fig. 5 and 6, the most important 10 genera in the random forest classifier based on MDA include Scardovia, ruminococcus noname, Bilophila, Bacteroides, Gemella, Alistipes, Oxalobacter, Solobacterium, bidolobacterium, and Clostridiales noname; the top 20 genera of most importance also include Roseburia, Anaerostipes, Parastutterella, Megamonas, Klebsiella, Butyricimonas, Collinsella, Shigella, Subdoligranum, and Flavonfractor. These were verified as characteristic bacterial groups. To improve the results of the random forest classifier, models were constructed using the top 10 MDA features and the top 20 MDA features.

This example uses the ROC curve and the area under the curve AUC to evaluate the performance of the RF binary classifier. As shown in fig. 7, the ordinate is sensitivity and the abscissa is specificity; PD can be distinguished from SP using all genera, with an AUC of 0.663, whereas the AUC for the variable using the LefSe method is only 76.0%, with an AUC of 0.795 using the first 10 MDA features and an AUC of 0.803 using the first 20 MDA features, the diagnostic accuracy is further improved.

The above workflow is completed in R (4.6-14, random forest package).

Claims

1. A marker combination for gut microbiology, wherein the marker combination comprises the following microorganisms: scatavia (Scardovia), ruminococcus (unnamed) (ruminococcus noname), cholephilus (Bilophila), Bacteroides (Bacteroides), gemfibrococcus (gemellal), Alistipes (Alistipes), Oxalobacter (Oxalobacter), sorafella (Solobacterium), Bifidobacterium (bifidum) and clostridium (unnamed) (clostridium noname).

2. The marker combination of claim 1 wherein said marker combination further comprises: rosellia (Roseburia), anaerobic Corynebacterium (Anaerostipes), ParaSalmonella (Parastutterella), Megamonas (Megamonas), Klebsiella (Klebsiella), butyric acid monad (Butyrimonas), Coriolis (Collinsella), Shigella (Shigella), rare Chlorella (Subdoligranum) and Flavobacterium (Flavonibacter).

3. A reagent combination comprising reagents capable of detecting a marker combination according to claim 1 or 2, such as reagents for PCR or sequencing; preferably, the reagent combination further comprises a marker combination according to claim 1 or 2.

4. Use of a marker combination according to claim 1 or 2 or a reagent combination according to claim 3 for the preparation of a diagnostic agent for the diagnosis of parkinson's disease.

5. An aided diagnosis model, comprising:

the prediction function is defined as follows: predict (rf, newdata ═ test _ df, type ═ response "); wherein test _ df is the information in (1);

preferably, the characteristic flora is a flora consisting of top-10 genera obtained based on an average accuracy reduction method; and/or, the information of the microbiologic characterization and relative abundance of the known sample and the test sample is obtained by microbiome metagenomic analysis, such as metaphan 2;

more preferably, the characteristic flora is a flora consisting of top-20 genera obtained based on an average accuracy reduction method;

even more preferably, the sample to be tested is feces, such as feces from a subject in the Hubei Xiangyang region.

6. The aided diagnosis model of claim 5, further comprising (0) a pre-processing module, and/or (3) an output module, wherein the pre-processing module performs extraction, library construction and sequencing on the DNA of the sample, obtains metagenome original reading of the DNA of the sample, removes noise, and transmits the noise-removed information to the input module; the output module is used for outputting the prediction result of the processing module;

wherein, the noise removal means: performing quality inspection on the metagenome original reading, and trimming a low-quality sequence to obtain the metagenome reading of the microbial DNA of the sample to be detected;

7. A method for obtaining a population characteristic of a known sample, comprising: based on the microbial taxonomy characterization and relative abundance information of the known sample, obtaining the characteristic flora of the known sample by adopting a random forest classifier;

wherein the random forest classifier is defined as follows: randomForest (class, data _ df, ntree, nPerm, 50, mtry floor (sqrt (ncol _ df) -1)), promimity T, and import T); wherein class is a dataset of information on taxonomic characterization and relative abundance of microorganisms for the known sample;

preferably, the characteristic flora is a flora consisting of top-10 genera obtained based on an average accuracy reduction method;

more preferably, the characteristic flora is a flora consisting of top-20 genera obtained based on an average accuracy reduction method; and/or, the information on the microbiology characterization and relative abundance of the known sample is obtained by microbiome metagenomic analysis, such as metaphan 2;

further preferably, the method further comprises the step of evaluating the accuracy of the random forest classifier; for example, the accuracy of the random forest classifier is assessed by cross-validation; the cross-validation is preferably selected from simple cross-validation, k-fold cross-validation, or leave-one cross-validation, such as leave-one cross-validation.

8. The method of claim 7, wherein the information on relative abundance is the difference in abundance of the bacterial population at different taxonomic levels assessed based on α diversity and β diversity;

preferably, the calculation method of the alpha diversity comprises a shannon index and/or a species abundance degree, and the evaluation method is a t test, preferably a Student's t test; the beta diversity assessment method comprises the following steps: genus abundance nonparametric replacement multivariate analysis of variance based on Bray-Curtis distance using, for example, vegan 2.5-4package, and principal coordinate analysis.

9. The method of claim 7, further comprising the step of preprocessing: extracting, constructing a library and sequencing the DNA of the known sample to obtain the metagenome original reading of the DNA of the known sample and removing noise;

wherein, the noise removal means: performing quality inspection on the metagenome original reading, and trimming a low-quality sequence to obtain the metagenome reading of the microbial DNA of the known sample;

10. A computer-readable storage medium, characterized in that the computer-readable medium stores a computer program which, when being executed by a processor, carries out the functions of an auxiliary diagnostic model as claimed in claim 5 or 6.