CN114566224B - Model for identifying or distinguishing people at different altitudes and application thereof - Google Patents

Model for identifying or distinguishing people at different altitudes and application thereof Download PDF

Info

Publication number
CN114566224B
CN114566224B CN202210221736.6A CN202210221736A CN114566224B CN 114566224 B CN114566224 B CN 114566224B CN 202210221736 A CN202210221736 A CN 202210221736A CN 114566224 B CN114566224 B CN 114566224B
Authority
CN
China
Prior art keywords
cag
enterobacter
bifidobacterium
citrobacter
microorganism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210221736.6A
Other languages
Chinese (zh)
Other versions
CN114566224A (en
Inventor
韩洋
何昆仑
姚咏明
田亚平
赵晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese PLA General Hospital
Original Assignee
Chinese PLA General Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese PLA General Hospital filed Critical Chinese PLA General Hospital
Priority to CN202210221736.6A priority Critical patent/CN114566224B/en
Publication of CN114566224A publication Critical patent/CN114566224A/en
Application granted granted Critical
Publication of CN114566224B publication Critical patent/CN114566224B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/689Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application discloses a model for identifying or distinguishing people at different altitudes and application thereof, and particularly provides application of microorganisms in constructing classification models of people at different altitudes. The application discloses 26 microorganisms in total, and the application of the 26 microorganisms in identifying or distinguishing people at different altitudes for the first time.

Description

Model for identifying or distinguishing people at different altitudes and application thereof
Technical Field
The application belongs to the field of biological medicine, and particularly relates to a model for identifying or distinguishing people at different altitudes and application thereof.
Background
Chronic altitude reaction/chronic altitude disease (CMS) is a common altitude disease, and is a syndrome of maladaptation of the altitude environment formed by continuous self-regulation of the body under long-term stimulation of the external hypoxia environment, but without physical balance on a new basis. The plateau usually refers to a region with an altitude of 2500 m or more and obvious biological effects (organism reactions), the influence of the plateau environment on the health of human bodies is very great, and the chronic plateau diseases occur in the region, and the hypoxia of the external environment is the most main cause of the chronic plateau diseases, even if the plateau diseases are seen by the living people or the colonizers (especially the colonizers). The Qinghai-Tibet plateau in China has obvious plateau environment characteristics, and in recent years, more and more people in the interior trunk are parked in the Tibet to participate in the construction of the region. Due to the great environmental differences between high altitudes and inland plain, the health problems of inland reservoir workers are increasingly prominent. Therefore, the research of strengthening chronic altitude stress has great guiding significance for accelerating altitude construction.
At present, the acute-stage altitude stress is more studied, but the chronic-stage altitude stress is still lack of systematic and deep study, so that a plurality of long-term storage staff can develop altitude heart diseases with different degrees after the work task is completed, such as myocardial hypertrophy, heart enlargement, ischemia, hypoxia, myocardial overload and the like. The above-mentioned people also have various degrees of altitude erythrocytosis, and symptoms include increased blood viscosity, slow blood flow, thrombosis, myocardial infarction, cerebral thrombosis, retinal vein thrombosis, etc. Such individuals may also develop neurological energy metabolism disorders, causing cerebral oedema, intracranial hypertension, brain tissue spotting or necrosis.
The intestinal flora structure of the personnel in the high altitude area is researched, and the risk factors of the occurrence of the chronic stage altitude stress are analyzed, so that the method has important significance in preventing the chronic stage altitude stress.
Disclosure of Invention
The application aims to provide a classification model for plain crowd and plateau crowd, and in order to achieve the purpose, the application adopts the following technical scheme:
in one aspect, the application provides a system for identifying or distinguishing people in different altitudes, comprising the following units:
1) And a detection unit: comprises a microorganism detection module;
2) Analysis unit: the abundance level of the microorganism detected by the detection unit is used as an input variable, and the input variable is input into classification models of people in different altitude areas for analysis;
3) An evaluation unit: outputting probability values of individuals corresponding to the samples as plain crowd/plateau crowd;
the microorganism is selected from one or more of s_ Bacteroides intestinalis CAG _564, s_ Bifidobacterium bifidum CAG _234, s_ Bifidobacterium longum, s_ Bifidobacterium merycicum, s_ Bifidobacterium subtile, s_citrobacter sp.36-4CPA, s_citrobacter sp.BIDMC107, s_ Coriobacterium glomerans, s_ Eggerthella lenta, s_Eggerthella sp.Y7918, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_ Enterobacter hormaechei, s_Enterobacter sp.BWH52, s_Enterobacter sp.MGH 25, s_ Eubacterium hallii CAG _12, s_Eubacter sp.14-2, s_Klebsiella sp.1_55, s_ Kluyvera ascorbata, s_ Kluyvera cryocrescens, s_ Lactobacillus sanfranciscensis, s_Ruminococcus sp.SR1_5, s_Succiniconic acid CAG_777, s_Sutterella sp.54_7, s_394, s_466, s_ Clostridiales bacterium VE-15.
In another aspect, the application provides the use of a microorganism selected from the group consisting of s_ Bacteroides intestinalis CAG _564, s_ Bifidobacterium bifidum CAG _234, s_ Bifidobacterium longum, s_ Bifidobacterium merycicum, s_ Bifidobacterium subtile, s_citrobacter sp.36-4CPA, s_citrobacter sp.BIDMC107, s_ Coriobacterium glomerans, s_ Eggerthella lenta, s_Eggerthella sp.Y7918, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_ Enterobacter hormaechei, s_Enterobacter sp.BWH52, s_Enterobacter sp.MGH 25, s_ Eubacterium hallii CAG _12, s_Eubeium sp.14-2, s_Klebsiella sp.1_55, s_ Kluyvera ascorbata, s_ Kluyvera cryocrescens, s_ Lactobacillus sanfranciscensis, s_Ruminococcus sp.SR1_5, s_Succinimas.777, s_397, s_4634, s_397, s_3936, and so forth.
As one embodiment, the classification model is determined using one or more algorithms selected from the group consisting of: XGBoost (XGB), random Forest (RF), glmnet, cforest, machine-learned classification and regression tree (CART), treebag, K-adjacency (kNN), neural network (nnet), support vector machine radial (SVM-radial), support vector machine linear (SVM-linear), naive Bayes (NB), or multi-layer perceptions (mlp).
In another aspect, the application provides a composition for identifying or distinguishing populations at different altitudes, the composition comprising reagents for measuring abundance levels of microorganisms selected from the group consisting of s_ Bacteroides intestinalis CAG _564, s_ Bifidobacterium bifidum CAG _234, s_ Bifidobacterium longum, s_ Bifidobacterium merycicum, s_ Bifidobacterium subtile, s_citrobacter sp.36-4CPA, s_citrobacter sp.BIDMC107, s_ Coriobacterium glomerans, s_ Eggerthella lenta, s_Eggerthellac sp.Y7918, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_ Enterobacter hormaechei, s_Enterbacter sp.BWH52, s_Enterbacter sp.MGH 25, s_ Eubacterium hallii CAG _12, s_Eubacter sp.14-2, s_Klebsiella sp.1_55, s_ Kluyvera ascorbata, s_ Kluyvera cryocrescens, s_ Lactobacillus sanfranciscensis, s_Ruminoccs.SR1_5, s_ Enterobacter hormaechei, s_Enterbacmid sp. Enterobacter hormaechei, s_English_397, s_397, and s_4634, and Subx.397.
As one embodiment, the reagent comprises a reagent for measuring the abundance level of a microorganism by 16S rRNA sequencing, whole genome sequencing, quantitative polymerase chain reaction, PCR-pyrosequencing, fluorescent in situ hybridization, microarray, or PCR-ELISA.
As one embodiment, the reagent comprises a primer, a probe, an antisense oligonucleotide, an aptamer, or an antibody.
In another aspect, the application provides the use of a composition as hereinbefore described for the manufacture of a means for identifying or distinguishing populations at different altitudes.
As one embodiment, the means comprises a chip, a kit, a test strip or a high throughput sequencing platform.
In another aspect, the application provides a method of identifying or differentiating populations at different altitudes, the method comprising detecting the abundance of a microorganism selected from the group consisting of s_ Bacteroides intestinalis CAG _564, s_ Bifidobacterium bifidum CAG _234, s_ Bifidobacterium longum, s_ Bifidobacterium merycicum, s_ Bifidobacterium subtile, s_citrobacter sp.36-4CPA, s_citrobacter sp.BIDMC107, s_ Coriobacterium glomerans, s_ Eggerthella lenta, s_Eggerthella sp.Y7918, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_ Enterobacter hormaechei, s_Enterobacter sp.BWH52, s_Enterobacter sp.MGH 25, s_ Eubacterium hallii CAG _12, s_Eubacter sp.14-2, s_Klebsiella sp.1_55, s_ Kluyvera ascorbata, s_ Kluyvera cryocrescens, s_ Lactobacillus sanfranciscensis, s_Ruminococcus sp.SR1_5, s_Succias_7, s_Czochrals7, s_397, s_4634, s_397, and so forth.
As one embodiment, the different altitude areas include a plateau area and a plains area.
Drawings
FIG. 1 is a graph of the contribution results for each feature;
FIG. 2 is a graph showing the correspondence between feature numbers and AUC values;
fig. 3 is a ROC curve of the optimal model.
Detailed Description
Hereinafter, the present application will be described in detail by way of examples of the present application with reference to the accompanying drawings. However, the following examples are presented as illustrations of the present application, and when it is determined that a detailed description of a technology or structure known to those skilled in the art to which the present application pertains is likely to unnecessarily obscure the gist of the present application, a detailed description thereof may be omitted, and the present application is not limited thereto. The present application can be variously modified and applied within the description of the scope of the following claims and the equivalents explained thereby.
Also, the terms used in the present specification are terms used for properly expressing the preferred embodiments of the present application, which may vary according to the intention of a user or operator or a convention in the art to which the present application pertains, etc. Accordingly, these terms should be defined below based on the contents throughout the specification. In the entire specification, when a certain portion "includes" a certain structural element, unless specifically stated to the contrary, this does not mean that other structural elements are excluded, but means that other structural elements may also be included.
Unless defined otherwise, all technical and scientific terms used in this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. Any methods and materials similar or equivalent to those described in this specification can be used in the practice for testing the present application, although the preferred materials and methods are described herein.
In order to construct classification models of people in different elevation areas, the application collects samples of the plain han group people and the plateau han group people, performs sequencing and bioinformatics analysis, and screens out the optimal classification models of 26 microorganisms. The 26 microorganisms can be used as classification factors for people at different altitudes for the first time.
In the present application, the term "difference in abundance" means that a higher or lower level of microorganisms is obtained in a population at different altitudes than in the control group.
In the present application, any method known in the art may be used to detect or determine the level of a microbial marker. These methods include, but are not limited to, methods utilizing sequence amplification of primers, immunological methods using antigen-antibody reactions. Among them, the method of sequence amplification using the primer may be, for example, polymerase Chain Reaction (PCR), reverse transcription-polymerase chain reaction (RT-PCR), multiplex PCR, touchdown PCR, hot start PCR, nested PCR, synergistic PCR, real-time PCR, differential PCR, cDNA end rapid amplification, inverse polymerase chain reaction, vector-mediated PCR, thermal asymmetric interleave PCR, ligase chain reaction, repair chain reaction, transcription-mediated amplification, autonomous sequence replication, selective amplification reaction of a target base sequence. The immunological method using the antigen-antibody reaction may be, for example, western blotting, enzyme-linked immunosorbent assay, radioimmunoassay, euclidean immunodiffusion method, rocket immunoelectrophoresis, tissue immunostaining, immunoprecipitation assay, complement fixation assay, fluorescence-activated cell sorter, protein chip, or the like, but the scope of the present application is not limited thereto.
In the present application, the reagent for measuring the abundance level of the microorganism may be a primer, a probe, an antisense oligonucleotide, an aptamer, or an antibody.
The term "primer" refers to 7 to 50 nucleic acid sequences that are capable of forming base pairs (bas e pair) complementary to the template strand and serve as starting points for replication of the template strand. Primers are usually synthesized, but naturally occurring nucleic acids may also be used. The sequence of the primer need not be exactly the same as the sequence of the template, but may be sufficiently complementary to hybridize with the template. Additional features may be incorporated that do not alter the basic properties of the primer. Examples of additional features that can be incorporated include methylation, capping, substitution of one or more nucleic acids with homologs, and modification between nucleic acids, but are not limited thereto.
The term "hybridization" refers to the annealing of two complementary nucleic acid strands to each other under suitably stringent conditions. Hybridization is typically performed using nucleic acid molecules of probe length. Nucleic acid hybridization techniques are well known in the art. Those skilled in the art know how to estimate and adjust the stringency of hybridization conditions such that sequences with at least a desired degree of complementarity will hybridize stably, while sequences with lower complementarity will not hybridize stably.
The term "probe" refers to a molecule that binds to a particular sequence or subsequence or other portion of another molecule. Unless otherwise indicated, the term "probe" generally refers to a polynucleotide probe that is capable of binding to another polynucleotide (often referred to as a "target polynucleotide") by complementary base pairing. Depending on the stringency of the hybridization conditions, the probe is able to bind to a target polynucleotide that lacks complete sequence complementarity with the probe. Probes may be labeled directly or indirectly, and include primers. Hybridization means include, but are not limited to: solution phase, solid phase, mixed phase or in situ hybridization assays.
The term "oligonucleotide" refers to a short polymer composed of deoxyribonucleotides, ribonucleotides, or any combination thereof. The length of an oligonucleotide is typically between about 10 nucleotides and about 100 nucleotides. The oligonucleotides are preferably 15 nucleotides to 70 nucleotides in length, most typically 20 nucleotides to 26 nucleotides in length. Oligonucleotides may be used as primers or probes.
The term "aptamer" is ribonucleic acid and single-stranded deoxyribonucleic acid that fold through hydrogen bonding between intra-strand bases to form stable hairpin, stem-loop, pseudoknot, pocket, bulge loop, G-quadruplex and other secondary or tertiary structures, and that produce high affinity and specific binding that is spatially matched to the target.
In the present application, the term "antibody" is used in the broadest sense and specifically covers, for example, monoclonal antibodies, polyclonal antibodies, antibodies with multi-epitope specificity, single chain antibodies, multi-specific antibodies and antibody fragments. Such antibodies may be chimeric, humanized, human and synthetic.
The area under the receiver operating curve (AUC) is an indicator of the performance or accuracy of a diagnostic procedure. The accuracy of the diagnostic method is best described by its Receiver Operating Characteristics (ROC). ROC plots are line graphs derived from all sensitivity/specificity pairs that continuously change the decision threshold over the entire data range observed.
The clinical performance of a laboratory test depends on its diagnostic accuracy or the ability to correctly classify subjects into a clinical Guan Ya group. Diagnostic accuracy measures the ability of a test to correctly discern two different conditions of a subject under investigation.
In each case, the ROC line graph depicts the overlap between the two distributions by plotting sensitivity versus 1-specificity for the entire range of decision thresholds. On the y-axis is the sensitivity, or true positive score [ defined as (number of true positive test results)/(number of true positive + number of false negative test results) ]. This is also referred to as positive for the presence of a disease or condition. It is calculated only from the affected subgroups. On the x-axis is a false positive score, or 1-specificity [ defined as (number of false positive results)/(number of true negative + number of false positive results) ]. It is an indicator of specificity and is calculated entirely from unaffected subgroups. Because the true and false positive scores are calculated completely separately using test results from two different subgroups, the ROC line graph is independent of the prevalence of disease in the sample. Each point on the ROC line graph represents a sensitivity/1-specificity pair corresponding to a particular decision threshold. One test with perfect discrimination (no overlap of the two results profiles) had a ROC line graph passing through the upper left corner, a true positive score of 1.0, or 100% (perfect sensitivity), and a false positive score of 0 (perfect specificity). One theoretical line plot for the test that did not distinguish (the results of the two groups were equally distributed) was a 45 ° diagonal from the lower left corner to the upper right corner. Most line graphs fall between these two extremes. Qualitatively (if the ROC line plot falls well below the 45 ° diagonal, then this is easily corrected by reversing the "positive" criterion from "greater than" to "less than" or vice versa.) the closer the line plot is to the upper left corner, the higher the overall accuracy of the test.
One convenient goal to quantify the diagnostic accuracy of a laboratory test is to express its performance by a single numerical value. The most common global metric is area under the ROC curve (AUC). Conventionally, this area is always ≡ 0.5 (if this is not the case, the decision rules can be reversed to make this the case). The range of values is between 1.0 (test values perfectly separating the two groups) and 0.5 (no significant distribution difference between the test values of the two groups). The area depends not only on the sensitivity at a specific part of the line graph, such as the point closest to the diagonal or at 90% specificity, but also on the whole line graph. This is a quantitative, descriptive representation of how the ROC line diagram is near perfect (area=1.0).
The application will now be described in further detail with reference to the drawings and examples. The following examples are only illustrative of the present application and are not intended to limit the scope of the application. The experimental procedure, in which specific conditions are not noted in the examples, is generally followed by conventional conditions.
Example 1 classification model for plain and plateau populations
1. Crowd information
Plain han group: han1k_ht (from the Han family living in plain, from the Xinjiang region at an altitude of about 1 km, 182); han1k_YC (Han nationality living in plain, originating from Xinjiang area at an altitude of about 1 km, 143)
Plateau Han population: han4k_6m (63 people after half a year of high primary life with a altitude of about 4 km); han4k (30 people of Han nationality living more than one year at a high primary living time of about 4 km at altitude) two and experimental method
1. Fecal sample collection and DNA extraction
Collecting the fecal samples of the crowd, and then adopting a kit to extract DNA to obtain an extracted DNA sample.
2. Metagenome high throughput sequencing and analysis
Sequencing by using an Illumina Hiseq sequencing platform to obtain 5,933,464.129,999,99Mbp original Data (Raw Data) (average Data size 7,756.16Mbp), performing quality control to obtain 5,885,567.3Mbp effective Data (Clean Data) (average Data size 7,693.55Mbp), and performing single-sample assembly and mixed assembly to obtain 97,165,177,458bp Scaftigs. Gene prediction was performed on each sample and the results of the mixed assembly using MetaGeneMark software to obtain 123,459,411 Open Reading Frames (ORFs) (average 161,385), and after redundancy removal, a total of 6,727,989 ORFs were obtained, with a total length of 4,584.45mbp, wherein the number of complete genes was 3,686,582, and the proportion was 54.79%. The non-redundant gene set is subjected to blastp comparison with a MicroNR library, and species annotation is carried out by using an LCA algorithm, wherein the proportions of the annotation to the genus and the phylum are 65.11% and 86.00%, respectively.
(1) Sequencing data pretreatment
Summary of quality control results: the total sequencing data amount was 5,933,464.129,999,99mbp, the average sequencing data amount was 7,756.16mbp, the total and average data amounts after quality control were 5,885,567.3mbp,7,693.55mbp, respectively, and the effective data rate for quality control was 99.19%.
The specific processing steps of the data preprocessing are as follows:
1) Removing reads containing low-quality bases (mass value < = 38) exceeding a certain proportion (40 bp by default);
2) Removing N bases to reach a certain proportion of reads (10 bp by default);
3) Removing reads which exceed a certain threshold value (set as 15bp by default) from overlap between adapters;
4) If the sample has host pollution, comparing the sample with a host database, and filtering reads possibly derived from the host;
(2) Metagenome assembly
Summary of assembly results: co-assembling to obtain Scaffolds of 105,500,331,957bp, with average length of 1,934.98bp, maximum length of 1,733,071bp, N50 of 4,517.84bp, N90 of 692.50bp; scaftibds are broken from N to produce Scaftigs, and 97,165,177,458bp of Scaftigs are obtained, wherein the average length of the Scaftigs is 1,868bp, N50 is 4,139bp, and N90 is 678bp. The specific processing steps of metaname assembly are as follows:
1) The Clean Data is obtained after pretreatment, and is assembled by using SOAP denovo assembling software;
2) For a single sample, firstly, selecting a K-mer (55 is selected by default) for assembly to obtain an assembly result of the sample;
3) Breaking the assembled scaffoldes from the N-junctions, resulting in a sequence fragment free of N, termed Scaftigs (i.e., continuous sequences within Scaffolds);
4) Comparing the CleanData subjected to quality control with the Scaftigs assembled by the samples by using Bowtie2 software to obtain PE reads which are not utilized;
5) Putting the ready of each sample which is not utilized together for mixed assembly, and only selecting one kmer for assembly (default-K55) in consideration of calculation consumption and time consumption during assembly, wherein other assembly parameters are the same as those of a single sample;
6) Breaking the mixed assembled Scaffolds from the N junction to obtain a N-free Scaftigs sequence;
7) Filtering fragments below 500bp for Scaftigs generated by single sample and mixed assembly, and carrying out statistical analysis and subsequent gene prediction;
(3) Gene prediction and abundance analysis
Summary of gene prediction results: a total of 123,459,411 ORFs were predicted, with an average of 161,385 ORFs per sample; after redundancy elimination, 6,727,989 ORFs with total length of 4,584.45Mbp, average length of 681.4bp and GC content of 45.77% are obtained, wherein 3,686,582 complete genes account for 54.79% of the total number of all non-redundant genes.
Basic steps of gene prediction:
1) ORF (Open Reading Frame) prediction and filtration was performed using MetaGeneMark, starting from each sample and mixed assembled Scaftigs (> = 500 bp);
2) Performing redundancy elimination on each sample and the ORF prediction result of the mixed assembly by adopting CD-HIT software;
3) Comparing the clear Data of each sample with the redundant representative genes, and calculating to obtain the numbers of reads of the genes in the comparison of each sample;
4) Filtering out genes supporting a number of reads >2 that are not present in each sample, obtaining a gene catalog (Unigenes) that is ultimately used for subsequent analysis;
5) Starting from the number of reads and the length of the genes on the comparison, calculating to obtain the abundance information of each gene in each sample;
6) Based on the abundance information of each gene in the gene category in each sample, basic information statistics, core-pan gene analysis, correlation analysis among samples and gene number wien diagram analysis are carried out.
(4) Species annotation
Species annotation results overview: among the ORFs that can be annotated to the NR database, the number of ORFs that can be annotated to the NR database was 5,317,849 (79.04%), the ratio of the annotated to the boundary level was 88.82%, the ratio of the gate level was 86.00%, the ratio of the line level was 81.43%, the ratio of the mesh level was 80.77%, the ratio of the family level was 69.52%, the ratio of the genus level was 65.11%, and the ratio of the seed level was 49.00%, among the predicted genes after the original redundancy removal was 6,727,989. The dominant gates include mainly Firmics, proteobacteria, bacterioides, etc. The gates with significant differences between groups are mainly k __ bacteria\; p __ actinomycetes, k __ bacteria\; p __ Chlamydiae, k __ Archaea\; p __ Euryanaeota et al.
The basic steps of annotation:
1) Unigenes were aligned with bacterial (bacterio), fungal (Fungi), archaea (Archaea) and viral (Viruses) sequences extracted from the NCBI's NR (Version: 2018.01) database using DIAMOND software (blastp, value < = 1 e-5);
2) And (3) filtering a comparison result: for the comparison result of each sequence, selecting the comparison result of the value < = minimum value 10 for subsequent analysis;
3) After filtering, adopting an LCA algorithm (applied to system classification of MEGAN software), and taking the classification level before the first branch as species annotation information of each sequence;
4) Starting from LCA annotation results and a gene abundance table, obtaining abundance information and gene number information of each sample on each classification level (the genus species of the family Mentha);
5) Starting from the abundance table on each classification level (the genus species of the phylum synopsidae), krona analysis, relative abundance profile display, abundance cluster heat map display, PCA and NMDS dimension reduction analysis, anosim inter (intra) group difference analysis, meta stat and LEfSe multivariate statistical analysis of the inter-group difference species were performed.
3. Construction of classification model
And establishing a machine learning classification model by utilizing the microbial species abundance information table obtained by the flow.
Based on XGBoost (eXtreme Gradient Boosting), selecting different numbers of intestinal microbial characteristics to classify the plain han population and the plateau han population, and finally taking the average value of AUC values (the area below the ROC curve) by using a ten-fold cross-validation mode, and finally screening 26 characteristics contained in an optimal classification model: s_ Bacteroides intestinalis CAG _564, s_ Bifidobacterium bifidum CAG _234, s_ Bifidobacterium longum, s_ Bifidobacterium merycicum, s_ Bifidobacterium subtile, s_citrobacter sp.36-4CPA, s_citrobacter sp.BIDMC107, s_ Coriobacterium glomerans, s_ Eggerthella lenta, s_Eggerthella sp.Y7918, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_ Enterobacter hormaechei, s_Enterobacter sp.BWH52, s_Enterobacter sp.MGH 25, s_ Eubacterium hallii CAG _12, s_Eubacter sp.14-2, s_Klebsiella sp.1_55, s_ Kluyvera ascorbata, s_ Kluyvera cryocrescens, s_ Lactobacillus sanfranciscensis, s_Ruminococcus sp.SR1_5, s_Succinatimus sp.CAG_777, s_rattlera.54_7, s_Suscore_4, s_396, s_ Clostridiales bacterium VE-15.
3. Experimental results
The model constructed based on the 26 features is the optimal model. FIG. 1 is a graph of the contribution results for each feature; FIG. 2 is a graph showing the correspondence between feature numbers and AUC values.
Fig. 3 is an ROC curve of the optimal model, auc=0.98±0.02, p <0.01, demonstrating that models constructed using these microorganisms can accurately distinguish plain han and plateau han.
The preferred embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application.
In addition, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described further.
Moreover, any combination of the various embodiments of the application can be made without departing from the spirit of the application, which should also be considered as disclosed herein.

Claims (9)

1. A system for identifying or distinguishing people at different altitudes, comprising the following elements:
1) And a detection unit: comprises a microorganism detection module;
2) Analysis unit: the abundance level of the microorganism detected by the detection unit is used as an input variable, and the input variable is input into classification models of people in different altitude areas for analysis;
3) An evaluation unit: outputting probability values of individuals corresponding to the samples as plain crowd/plateau crowd;
the microorganism is composed of s_ Bacteroides intestinalis CAG _564, s_ Bifidobacterium bifidum CAG _234, s_ Bifidobacterium longum, s_ Bifidobacterium merycicum, s_ Bifidobacterium subtile, s_citrobacter sp, 36-4CPA, s_citrobacter sp, BIDMC107, s_ Coriobacterium glomerans, s_ Eggerthella lenta, s_Eggerthella sp, YY7918, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_ Enterobacter hormaechei, s_Enterobacter sp, BWH, s_Enterobacter sp, MGH 25, s_ Eubacterium hallii CAG _12, s_Eubbacteria sp, 14-2, s_Klebsiella sp, 1_55, s_ Kluyvera ascorbata, s_ Kluyvera cryocrescens, s_ Lactobacillus sanfranciscensis, s_Ruminococcus sp, SR1_5, s_Succinatons CAG 777, s_Sutterella sp, 54_7, s_ Tepidibacter formicigenes, s_202-95, and s_3295.
2. The application of the microorganism in constructing classification models of people at different altitudes is characterized in that the microorganism consists of s_ Bacteroides intestinalis CAG _564, s_ Bifidobacterium bifidum CAG _234, s_ Bifidobacterium longum, s_ Bifidobacterium merycicum, s_ Bifidobacterium subtile, s_citrobacter sp.36-4CPA, s_citrobacter sp.BIDMC107, s_ Coriobacterium glomerans, s_ Eggerthella lenta, s_Eggerthella sp.YY7918, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_ Enterobacter hormaechei, s_Enterobacter sp.3852, s_Enterobacter sp.H2, s_ Eubacterium hallii CAG _12, s_Eubbacterium sp.14-2, s_Klebsiella sp.1_1_55, s_ Kluyvera ascorbata, s_ Kluyvera cryocrescens, s_ Lactobacillus sanfranciscensis, s_Ruminococcus sp.SR1_5, s_Succinimas_777, s_Mgt7, s_3295_956_gjgjg95, and s_3295_9515.
3. The use of claim 2, wherein the classification model is determined using one or more algorithms selected from the group consisting of: XGBoost, random forest, glmnet, cforest, machine learned classification and regression tree, treebag, k neighbor, neural network, radial support vector machine, linear support vector machine, naive bayes or multi-layer perceptrons.
4. Use of a composition for identifying or distinguishing people at different altitudes for the preparation of a tool for identifying or distinguishing people at different altitudes, characterized in that the composition comprises reagents for measuring the abundance level of a microorganism consisting of s Bacteroides intestinalis CAG _564, s Bifidobacterium bifidum CAG _234, s Bifidobacterium longum, s Bifidobacterium merycicum, s Bifidobacterium subtile, s Citrobacter sp 36-4CPA, s Citrobacter sp BIDMC107, s Coriobacterium glomerans, s Eggerthella lenta, s Eggerthella sp YY7918, s Enterobacter cloacae complex _ Hoffmann cluster IV, s Enterobacter hormaechei, s Enterbacter sp 3852, s Enterbacter sp MGH 25, s_4_12, s Eubacter sp 14-2, s Klebsiella 1_55, s Kluyvera ascorbata, s Kluyvera cryocrescens, s Lactobacillus sanfranciscensis, s_sp_5, s_ Enterobacter cloacae complex _ Hoffmann cluster IV, s_3756, s Enterbacter sp 3935, s_4_12, s_Eubsiella sp 14-2, s Klebsiella 1_55, s Kluyvera ascorbata, s_4629, s_5235, s_sp_55, s_55, s_sp_95, s_95_95, s_95_6, and s_95_95_95.
5. The use of claim 4, wherein the reagent comprises a reagent for measuring the abundance level of a microorganism by 16S rRNA sequencing, whole genome sequencing, quantitative polymerase chain reaction, PCR-pyrophosphate sequencing, fluorescent in situ hybridization, microarray, or PCR-ELISA.
6. The use of claim 4, wherein the agent comprises a primer, a probe, an antisense oligonucleotide, an aptamer, or an antibody.
7. The use of claim 6, wherein the means comprises a chip, a kit, a test strip or a high throughput sequencing platform.
8. A method of identifying or differentiating populations at different altitudes, the method comprising detecting the abundance of a microorganism consisting of s Bacteroides intestinalis CAG _564, s Bifidobacterium bifidum CAG _234, s Bifidobacterium longum, s Bifidobacterium merycicum, s Bifidobacterium subtile, s Citrobacter sp 36-4CPA, s Citrobacter sp, bimchemical 107, s Coriobacterium glomerans, s Eggerthella lenta, s Eggerthella sp, y7918, s Enterobacter cloacae complex _ Hoffmann cluster IV, s Enterobacter hormaechei, s Enterobacter sp BWH, s Enterobacter sp MGH 25, s Eubacterium hallii CAG _12, s eubacter sp 14-2, s Klebsiella sp 1_1_55, s Kluyvera ascorbata, s Kluyvera cryocrescens, s Lactobacillus sanfranciscensis, s Ruminococcus sp 1_5, s succinimid g 7, s_cfp 95, s_95_7754, s_95_95, and s_95_95_95_95.
9. The method of claim 8, wherein said different elevation regions comprise a plateau region, a plains region.
CN202210221736.6A 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof Active CN114566224B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210221736.6A CN114566224B (en) 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210221736.6A CN114566224B (en) 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof

Publications (2)

Publication Number Publication Date
CN114566224A CN114566224A (en) 2022-05-31
CN114566224B true CN114566224B (en) 2023-08-11

Family

ID=81717079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210221736.6A Active CN114566224B (en) 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof

Country Status (1)

Country Link
CN (1) CN114566224B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115274123B (en) * 2022-07-15 2023-03-24 中国人民解放军总医院 Physical ability level prediction method, system, device, medium, and program product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109652493A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 The bacillus gram that quivers, which belongs to, is identifying and/or is distinguishing the application in not agnate individual
CN109652570A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual
CN109825561A (en) * 2019-01-16 2019-05-31 中国人民解放军总医院 Quasi- Prey irrigates Pseudomonas and is identifying and/or distinguishing the application in not agnate individual
CN109913525A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Butyrivibrio is identifying and/or is distinguishing the application in highlands Chinese Han Population and Tibetan populations
CN109913526A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10428370B2 (en) * 2016-09-15 2019-10-01 Sun Genomics, Inc. Universal method for extracting nucleic acid molecules from a diverse population of one or more types of microbes in a sample
CA3116010A1 (en) * 2018-10-26 2020-04-30 Sun Genomics Inc. Universal method for extracting nucleic acid molecules from a diverse population of microbes
EP3855347B1 (en) * 2020-01-21 2022-06-22 Axis AB Distinguishing human beings in a crowd in an image

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109652493A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 The bacillus gram that quivers, which belongs to, is identifying and/or is distinguishing the application in not agnate individual
CN109652570A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual
CN109825561A (en) * 2019-01-16 2019-05-31 中国人民解放军总医院 Quasi- Prey irrigates Pseudomonas and is identifying and/or distinguishing the application in not agnate individual
CN109913525A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Butyrivibrio is identifying and/or is distinguishing the application in highlands Chinese Han Population and Tibetan populations
CN109913526A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
以微生物基因组为核心探究肠道菌群、饮食与人 体健康的互作;吴国军;《中国博士论文全文数据库基础学科辑》(第01期);A006-248 *

Also Published As

Publication number Publication date
CN114566224A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
US20200172978A1 (en) Apparatus, kits and methods for the prediction of onset of sepsis
CN108368551B (en) Method for diagnosing tuberculosis
US20180282809A1 (en) A METHOD FOR DIAGNOSING A DISEASE BY DETECTION OF circRNA IN BODILY FLUIDS
US8765371B2 (en) Method for the in vitro detection and differentiation of pathophysiological conditions
US20110076685A1 (en) Method for in vitro detection and differentiation of pathophysiological conditions
CN107660234A (en) The method of prediction organ-graft refection is sequenced using two generations
US20220251647A1 (en) Gene expression signatures useful to predict or diagnose sepsis and methods of using the same
CN105473743A (en) Sepsis biomarkers and uses thereof
Roth et al. Differentially regulated miRNAs as prognostic biomarkers in the blood of primary CNS lymphoma patients
CN110283903A (en) Intestinal microflora for Diagnosis of Pancreatic inflammation
CN104968802A (en) Novel miRNAs as diagnostic markers
US20220073986A1 (en) Method of characterizing a neurodegenerative pathology
US20150100242A1 (en) Method, kit and array for biomarker validation and clinical use
CN114566224B (en) Model for identifying or distinguishing people at different altitudes and application thereof
CN111647673A (en) Application of microbial flora in acute pancreatitis
CN115261499B (en) Intestinal microbial marker related to endurance and application thereof
CN112063709B (en) Diagnosis kit for myasthenia gravis by taking microorganisms as diagnosis markers and application
WO2015117205A1 (en) Biomarker signature method, and apparatus and kits therefor
CN114839369B (en) Acute altitude stress microbial marker and application thereof
JP2021175381A (en) Method for detecting infant atopic dermatitis
CN112634983B (en) Pathogen species specific PCR primer optimization design method
CN115472294B (en) Model for predicting transformation speed of small cell transformation lung adenocarcinoma patient and construction method thereof
CN111996248B (en) Reagent for detecting microorganism and application thereof in diagnosis of myasthenia gravis
CN112226501B (en) Intestinal flora marker for myasthenia gravis and application thereof
CN116904575A (en) Biomarker related to physical decline of silicosis patient and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant