CN114566224A - Model for identifying or distinguishing different altitude crowds and application thereof - Google Patents

Model for identifying or distinguishing different altitude crowds and application thereof Download PDF

Info

Publication number
CN114566224A
CN114566224A CN202210221736.6A CN202210221736A CN114566224A CN 114566224 A CN114566224 A CN 114566224A CN 202210221736 A CN202210221736 A CN 202210221736A CN 114566224 A CN114566224 A CN 114566224A
Authority
CN
China
Prior art keywords
enterobacter
microbacterium
bifidobacterium
klebsiella
lactobacillus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210221736.6A
Other languages
Chinese (zh)
Other versions
CN114566224B (en
Inventor
韩洋
何昆仑
姚咏明
田亚平
赵晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese PLA General Hospital
Original Assignee
Chinese PLA General Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese PLA General Hospital filed Critical Chinese PLA General Hospital
Priority to CN202210221736.6A priority Critical patent/CN114566224B/en
Publication of CN114566224A publication Critical patent/CN114566224A/en
Application granted granted Critical
Publication of CN114566224B publication Critical patent/CN114566224B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/689Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Public Health (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Biochemistry (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a model for identifying or distinguishing people at different altitudes and application thereof, and particularly provides application of microorganisms in constructing classification models of people at different altitudes. The invention discloses 26 kinds of microorganisms, and discloses application of the 26 kinds of microorganisms in identifying or distinguishing people at different altitudes for the first time.

Description

Model for identifying or distinguishing different altitude crowds and application thereof
Technical Field
The invention belongs to the field of biomedicine, and particularly relates to a model for identifying or distinguishing people with different altitudes and application thereof.
Background
Chronic altitude sickness/chronic altitude sickness (CMS) is a common altitude sickness, and under long-term stimulation of an external hypoxic environment, although the organism is uninterruptedly regulated by itself, the organism is always unable to achieve physical function balance on a new basis, and a high altitude environmental adaptation insufficiency syndrome is formed. Plateau usually refers to an area with an altitude of more than 2500 m and obvious biological effects (organism reaction), the plateau environment has a great influence on the health of human bodies, both living people and immigrants (especially immigrants), chronic plateau diseases mostly occur in the area, and hypoxia in the external environment is the most important cause of the chronic plateau diseases. The Qinghai-Tibet plateau in China has obvious plateau environmental characteristics, and in recent years, more and more staffs of inland cadres, enterprises and public institutions are allocated to participate in construction in the region. Due to the great difference between the high altitude and the environment between the inland plains, the health problem of the inland occupational workers is increasingly prominent. Therefore, the strengthening of the research on the chronic altitude stress has great guiding significance for accelerating the altitude construction.
At present, the acute phase altitude stress is researched more, but the chronic phase altitude stress is still lack of systematic and deep research, so that many long-term rescue workers have altitude heart diseases of different degrees after the completion of work tasks, such as cardiac hypertrophy, cardiac enlargement, ischemia, hypoxia, cardiac overload and the like. The people also have high-grade erythrocytosis with different degrees, and symptoms of increased blood viscosity, slow blood flow, thrombosis, myocardial infarction, cerebral thrombosis, retinal vein thrombosis and the like are presented. The above-mentioned persons may also develop disorders of the nervous system energy metabolism, causing cerebral edema, intracranial hypertension, and degeneration or necrosis of brain tissue spots.
The method is used for researching the intestinal flora structure of personnel in high altitude regions and analyzing the risk factors of the chronic stage altitude stress, and has important significance for preventing the chronic stage altitude stress.
Disclosure of Invention
The invention aims to provide a classification model for plain people and plateau people, and the classification model adopts the following technical scheme for realizing the purpose:
the invention provides a system for identifying or distinguishing crowds in different altitude areas on one hand, which comprises the following units:
1) a detection unit: comprises a microorganism detection module;
2) an analysis unit: the abundance level of the microorganism detected by the detection unit is used as an input variable and is input into classification models of people in different altitude areas for analysis;
3) an evaluation unit: outputting the probability value of the individual corresponding to the sample as the plain crowd/the plateau crowd;
the microorganism is selected from s _ bacteria intestinalis CAG _564, s _ Bifidobacterium duplex CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium mericum, s _ Bifidobacterium sub, s _ Clostridium sp.36-4CPA, s _ Clostridium sp.BIC107, s _ Coriobacterium glomerans, s _ Eggerthella length, s _ Eggerthella sp.Y7918, s _ Enterobacter cloacae _ filler _ Hoffmann IV, s _ Enterobacter Hoffmann H52, s _ Enterobacter sp.MGH25, S _ Enterobacter sp.12, S _ Enterobacter hoffm, S _ Enterobacter sp.BWH52, S _ Enterobacter sp.MGH25, S _ Lactobacillus sp.12, C _ bacteria sp.1, C _ bacterial sp.1, S _ Lactobacillus sp.777, C _ bacterial sp.1, S _ Lactobacillus sp.1, S _ Bacillus sp.202, S _ Bacillus _ strain.
In another aspect, the present invention provides the use of microorganisms selected from the group consisting of s _ bacteria intestinalis CAG _564, s _ Bifidobacterium duplex CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium mericum, s _ Bifidobacterium subintium, s _ Clostridium sp.36-4CPA, s _ Clostridium sp.BIDMC107, s _ Corynebacterium glans, s _ Egghela sp.YYYYYY 7918, s _ Enterobacter claceae complex _ Hoffmann cler IV, s _ Enterobacter faecalhei, s _ Enterobacter sp.52, S _ Enterobacter claceae complex _ Hoffmann cler IV, S _ Enterobacter sp.25, Klebsiella _ Microbacterium sp.1, Klebsiella _ Microbacterium sp.777, Klebsiella _ Microbacterium sp.22, Klebsiella _ Microbacterium sp.20, Klebsiella _ Microbacterium sp.1.20.
As an embodiment, the classification model is determined using one or more algorithms selected from the group consisting of: xgboost (xgb), Random Forest (RF), glmnet, cforest, machine learning classification and regression tree (CART), treebag, K-neighborhood (kNN), neural network (nnet), support vector machine radial (SVM-radial), support vector machine linear (SVM-linear), Naive Bayes (NB), or multi-layer sensing (mlp).
In another aspect, the present invention provides a composition for identifying or differentiating people of different altitudes, the composition comprising an agent for measuring the abundance level of a microorganism selected from the group consisting of s _ bacteria intestinalis CAG _564, s _ Bifidobacterium duplex CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium lactium, s _ Clostridium subtile, s _ Clostridium sp.36-4CPA, s _ Clostridium sp.DMC107, s _ Corynebacterium globomerans, s _ Escherichia coli 7918, s _ Enterobacter compact _ Hoffmann purifier IV, s _ Enterobacter sp.52, Klebs _ 20, Klebs _ Microbacterium sp.7, Klebs _ Bacillus sp.7, Klebs _ Microbacterium sp.7, Klebs _ Bacillus sp.5, Klebs _ Bacillus sp.7 One or more of s _ Clostridium bacteria VE 202-15.
In one embodiment, the reagent comprises a reagent for measuring the abundance level of a microorganism by 16S rRNA sequencing, whole genome sequencing, quantitative polymerase chain reaction, PCR-pyrosequencing, fluorescence in situ hybridization, microarray, or PCR-ELISA.
In one embodiment, the agent comprises a primer, a probe, an antisense oligonucleotide, an aptamer, or an antibody.
According to a further aspect of the present invention there is provided the use of a composition as hereinbefore described in the manufacture of a means for identifying or differentiating populations at different altitudes.
In one embodiment, the means comprises a chip, a kit, a strip or a high throughput sequencing platform.
In another aspect of the present invention, there is provided a method for identifying or differentiating populations at different altitudes, said method comprising detecting the abundance of microorganisms, the microorganism is selected from s _ bacteria intestinalis CAG _564, s _ Bifidobacterium duplex CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium mericum, s _ Bifidobacterium sub, s _ Clostridium sp.36-4CPA, s _ Clostridium sp.BIC107, s _ Coriobacterium glomerans, s _ Eggerthella length, s _ Eggerthella sp.Y7918, s _ Enterobacter cloacae _ filler _ Hoffmann IV, s _ Enterobacter Hoffmann H52, s _ Enterobacter sp.MGH25, S _ Enterobacter sp.12, S _ Enterobacter hoffm, S _ Enterobacter sp.BWH52, S _ Enterobacter sp.MGH25, S _ Lactobacillus sp.12, C _ bacteria sp.1, C _ bacterial sp.1, S _ Lactobacillus sp.777, C _ bacterial sp.1, S _ Lactobacillus sp.1, S _ Bacillus sp.202, S _ Bacillus _ strain.
In one embodiment, the different elevation areas include plateau areas and plain areas.
Drawings
FIG. 1 is a graph of the contribution value results for each feature;
FIG. 2 is a graph of feature numbers versus AUC values;
FIG. 3 is a ROC curve of the optimal model.
Detailed Description
Hereinafter, the present invention will be described in detail by way of examples thereof with reference to the accompanying drawings. However, the following examples are provided as illustrations of the present invention, and when it is judged that a detailed description of a technique or a structure known to those of ordinary skill in the art to which the present invention pertains may unnecessarily obscure the gist of the present invention, a detailed description thereof may be omitted, and the present invention is not limited thereto. The present invention can be variously modified and applied within the scope of the following claims and equivalents to be explained thereby.
Also, the terms used in the present specification are terms used to appropriately express preferred embodiments of the present invention, and may vary according to the intention of a user or operator, the convention in the art to which the present invention pertains, and the like. Therefore, these terms should be defined based on the contents throughout the specification. In the present invention, the term "includes" or "including" a certain component in a certain portion is not intended to exclude another component but may include another component unless specifically stated to the contrary.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Any methods and materials similar or equivalent to those described herein can be used in the practice for testing the present invention, but the preferred materials and methods are described herein.
In order to construct classification models of crowds in different altitude areas, samples of plains Han nationality crowds and plateau Han nationality crowds are collected, sequencing and bioinformatics analysis are carried out, and the optimal classification models of the 26 microorganisms are screened out. The 26 microorganisms can be used as classification factors of people in different altitudes for the first time.
In the present invention, the term "abundance difference" means that a higher or lower level of microorganisms is obtained in a population at a different altitude than in a control group.
In the present invention, any method known in the art can be used to detect a microbial marker or to determine the level of a microbial marker. These methods include, but are not limited to, a method of sequence amplification using primers, and an immunological method using an antigen-antibody reaction. Among them, the method of sequence amplification using the primer may be, for example, Polymerase Chain Reaction (PCR), reverse transcription-polymerase chain reaction (RT-PCR), multiplex PCR, touchdown PCR, hot start PCR, nested PCR, PCR amplification, real-time PCR, differential PCR, rapid amplification of cDNA ends, reverse polymerase chain reaction, vector-mediated PCR, thermal asymmetric cross PCR, ligase chain reaction, repair chain reaction, transcription-mediated amplification, autonomous sequence replication, selective amplification reaction of a target base sequence. The immunological method using the antigen-antibody reaction may be, for example, western blotting, enzyme-linked immunosorbent assay, radioimmunoassay, radioimmunodiffusion, euclidean immunodiffusion, rocket immunoelectrophoresis, tissue immunostaining, immunoprecipitation assay, complement fixation assay, fluorescence activated cell sorter, protein chip, etc., but the scope of the present invention is not limited thereto.
In the present invention, the reagent for measuring the abundance level of the microorganism may be a primer, a probe, an antisense oligonucleotide, an aptamer, or an antibody.
The term "primer" refers to 7 to 50 nucleic acid sequences capable of forming a base pair (bas e pair) complementary to a template strand and serving as a starting point for replication of the template strand. The primer is generally synthesized, but a naturally occurring nucleic acid may be used. The sequence of the primer does not necessarily need to be completely identical to the sequence of the template, and may be sufficiently complementary to hybridize with the template. Additional features that do not alter the basic properties of the primer may be incorporated. Examples of additional features that may be incorporated include, but are not limited to, methylation, capping, substitution of more than one nucleic acid with a homolog, and modification between nucleic acids.
The term "hybridization" refers to the annealing of two complementary nucleic acid strands to one another under conditions of appropriate stringency. Hybridization is generally carried out using nucleic acid molecules of probe length. Nucleic acid hybridization techniques are well known in the art. Those skilled in the art know how to estimate and adjust the stringency of hybridization conditions such that sequences with at least the desired degree of complementarity will stably hybridize, while sequences with lower complementarity will not stably hybridize.
The term "probe" refers to a molecule that binds to a specific sequence or subsequence or other portion of another molecule. Unless otherwise indicated, the term "probe" generally refers to a polynucleotide probe that is capable of binding to another polynucleotide (often referred to as a "target polynucleotide") by complementary base pairing. Depending on the stringency of the hybridization conditions, a probe can bind to a target polynucleotide that lacks complete sequence complementarity to the probe. The probe may be labeled directly or indirectly, and includes within its scope a primer. Hybridization formats include, but are not limited to: solution phase, solid phase, mixed phase or in situ hybridization assays.
The term "oligonucleotide" refers to a short polymer composed of deoxyribonucleotides, ribonucleotides, or any combination thereof. The length of the oligonucleotide is typically between 10 nucleotides and about 100 nucleotides in length. The oligonucleotide is preferably from 15 nucleotides to 70 nucleotides in length, most typically from 20 nucleotides to 26 nucleotides. Oligonucleotides may be used as primers or probes.
The term "aptamer" refers to ribonucleic acid and single-stranded deoxyribonucleic acid which are folded through hydrogen bonding among bases in a chain to form stable secondary or tertiary structures such as hairpins, stem loops, false knots, pockets, bulge loops and G-quadruplexes and are combined with a target in a space structure matched and high-affinity and specific mode.
In the present invention, the term "antibody" is used in the broadest sense and specifically covers, for example, monoclonal antibodies, polyclonal antibodies, antibodies with polyepitopic specificity, single chain antibodies, multispecific antibodies and antibody fragments. Such antibodies can be chimeric, humanized, human and synthetic.
The area under the receiver operating curve (AUC) is an indicator of the performance or accuracy of the diagnostic procedure. The accuracy of a diagnostic method is best described by its Receiver Operating Characteristics (ROC). ROC plots are line graphs of all sensitivity/specificity pairs derived from continuously varying decision thresholds across the entire data range observed.
The clinical performance of a laboratory test depends on its diagnostic accuracy, or the ability to correctly classify a subject into a clinically relevant subgroup. Diagnostic accuracy measures the ability to correctly discriminate between two different conditions of the subject under investigation.
In each case, the ROC line graph depicts the overlap between the two distributions by plotting sensitivity versus 1-specificity for the entire range of decision thresholds. On the y-axis is the sensitivity, or true positive score [ defined as (number of true positive test results)/(number of true positives + number of false negative test results) ]. This is also referred to as a positive for the presence of a disease or condition. It is calculated from the affected subgroups only. On the x-axis is the false positive score, or 1-specificity [ defined as (number of false positive results)/(number of true negatives + number of false positive results) ]. It is an indicator of specificity and is calculated entirely from unaffected subgroups. Because the true and false positive scores are calculated completely separately using test results from two different subgroups, the ROC line graph is independent of the prevalence of disease in the sample. Each point on the ROC line graph represents a sensitivity/1-specificity pair corresponding to a particular decision threshold. One test with perfect discrimination (no overlap of the two result distributions) has an ROC plot through the upper left corner where the true positive score is 1.0, or 100% (perfect sensitivity), and the false positive score is 0 (perfect specificity). A theoretical line graph for an undifferentiated test (the results of the two groups are equally distributed) is a 45 ° diagonal from the lower left to the upper right. Most line graphs fall between these two extremes. (if the ROC line graph falls well below the 45 ° diagonal, this is easily corrected by reversing the criteria for "positive" from "greater to" less than "or vice versa.) qualitatively, the closer the line graph is to the upper left corner, the higher the overall accuracy of the test.
One convenient goal to quantify the diagnostic accuracy of a laboratory test is to express its performance by a single numerical value. The most common global metric is the area under the ROC curve (AUC). Conventionally, this area is always ≧ 0.5 (if not, the decision rule can be reversed to do so). The range of values was between 1.0 (test values that perfectly separated the two groups) and 0.5 (no significant distribution difference between the test values of the two groups). The area depends not only on a particular part of the line graph, such as the point closest to the diagonal or the sensitivity at 90% specificity, but also on the entire line graph. This is a quantitative, descriptive representation of how the ROC plot is close to perfect (area 1.0).
The present invention will be described in further detail with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. The experimental methods in the examples, in which specific conditions are not specified, are generally carried out under conventional conditions.
Example 1 Classification model of plain population and plateau population
First, crowd information
Plain Han nationality population: han1k _ HT (Han nationality living in plain, 182 people from Xinjiang with an altitude of about 1 km); han1k _ YC (Han nationality living in plain, originating from Xinjiang area at an altitude of about 1 km, 143 persons)
Plateau Chinese population: han4k _6m (63 people who live half a year on plateau with an altitude of about 4 km); han4k (30 people in Chinese population living in plateau with altitude of about 4 km for more than one year) II and experimental method
1. Fecal sample collection and DNA extraction
And (3) after collecting the excrement sample of the crowd, carrying out DNA extraction by using the kit to obtain an extracted DNA sample.
2. Metagenome high-throughput sequencing and analysis
The method comprises the steps of sequencing by an Illumina HiSeq sequencing platform, obtaining 5,933,464.129,999,99Mbp Raw Data (Raw Data) (the average Data amount is 7,756.16Mbp), obtaining 5,885,567.3Mbp effective Data (Clean Data) (the average Data amount is 7,693.55Mbp) through quality control, and obtaining 97,165,177,458bp Scaftigs after single-sample assembly and mixed assembly. And (3) performing gene prediction on each sample and the result of mixed assembly by adopting MetaGeneMark software to obtain 123,459,411 Open Reading Frames (ORFs) (the average is 161,385), and removing redundancy to obtain 6,727,989 ORFs with the total length of 4,584.45Mbp, wherein the number of the complete genes is 3,686,582, and the proportion is 54.79%. Performing blastp comparison on the non-redundant gene set and a MicroNR library, and performing species annotation by using an LCA algorithm, wherein the proportions of the annotation to the genus and the phylum are 65.11% and 86.00% respectively.
(1) Sequencing data preprocessing
The quality control results are summarized as follows: the total sequencing data amount is 5,933,464.129,999 and 99Mbp, the average sequencing data amount is 7,756.16Mbp, the total data amount and the average data amount after quality control are 5,885,567.3Mbp and 7,693.55Mbp respectively, and the effective data rate of the quality control is 99.19%.
The specific processing steps of the data preprocessing are as follows:
1) removing reads containing low-quality bases (the quality value is 38) which exceeds a certain proportion (default is 40 bp);
2) removing N bases to reach a certain proportion of reads (default is set as 10 bp);
3) removing reads with overlap exceeding a certain threshold (default set to 15bp) between the reads and the Adapter;
4) if host pollution exists in the sample, comparing the sample with a host database, and filtering reads possibly from the host;
(2) metagenome Assembly
Summary of assembly results: the 105,500,331,957bp Scaffolds are obtained by co-assembly, the average length is 1,934.98bp, the maximum length is 1,733,071bp, N50 is 4,517.84bp, and N90 is 692.50 bp; scaftilds were generated by breaking the Scaftigs from N, yielding 97,165,177,458bp Scaftigs with an average length of 1,868bp, N50 of 4,139bp and N90 of 678 bp. The specific processing steps of Metagenome assembly are as follows:
1) obtaining clear Data after preprocessing, and assembling by using SOAP denovo assembly software;
2) for a single sample, firstly selecting a K-mer (default selection is 55) for assembling to obtain an assembling result of the sample;
3) disrupting the assembled scffolds from the N junction to yield a sequence fragment containing no N, referred to as scftags (i.e., continuous sequences with scffolds);
4) comparing the CleanData subjected to quality control of each sample to the assembled Scaftigs of each sample by adopting Bowtie2 software, and acquiring unused PE reads;
5) putting the unused reads of each sample together, and performing mixed assembly, wherein only one kmer is selected for assembly (default-K55) in consideration of calculation consumption and time consumption during assembly, and other assembly parameters are the same as those of a single sample;
6) breaking the mixed assembled Scafbolds from the N junction to obtain a Scaftigs sequence without N;
7) filtering fragments below 500bp from single samples and Scaftigs generated by mixed assembly, and performing statistical analysis and subsequent gene prediction;
(3) gene prediction and abundance analysis
Summary of gene prediction results: a total of 123,459,411 ORFs were predicted, with an average of 161,385 ORFs per sample; after redundancy removal, 6,727,989 ORFs are obtained, the total length of the ORFs after redundancy removal is 4,584.45Mbp, the average length is 681.4bp, the GC content is 45.77%, wherein 3,686,582 complete genes account for 54.79% of the total number of all non-redundant genes.
Basic steps of gene prediction:
1) starting from each sample and mixed assembled scans (> < 500bp), orf (open Reading frame) prediction and filtration were performed using MetaGeneMark;
2) performing redundancy removal on each sample and ORF prediction results of mixed assembly by adopting CD-HIT software;
3) comparing the Clean Data of each sample to the redundancy-removed representative gene, and calculating to obtain the numbers of reads of the gene in each sample;
4) filtering out genes that support a reads number >2 that are not present in each sample to obtain a gene catalog (Unigenes) that is ultimately used for subsequent analysis;
5) calculating to obtain abundance information of each gene in each sample from the number of reads and the length of the gene in comparison;
6) based on the abundance information of each gene in the gene catalog in each sample, basic information statistics, core-pan gene analysis, correlation analysis among samples and gene number wain graph analysis are carried out.
(4) Species annotation
Species annotation results summary: the original redundancy-removed predicted genes had 6,727,989, in which the number of ORFs that could be annotated to the NR database was 5,317,849 (79.04%), and in the ORFs that could be annotated to the NR database, the proportion of annotation to the border level was 88.82%, the proportion of the phylum level was 86.00%, the proportion of the class level was 81.43%, the proportion of the mesh level was 80.77%, the proportion of the family level was 69.52%, the proportion of the genus level was 65.11%, and the proportion of the species level was 49.00%. The dominant doors include primarily Firmicutes, Proteobacteria, bacteriodes, and the like. The gates with significant differences between groups were mainly k __ Bacteria; p __ Actinobacteria, k __ Bacteria \ cell; p __ Chlamydiae, k __ Archaea; p __ Euryarchaeota et al.
Annotating the basic steps:
1) unigenes were aligned to bacterial (Bacteria), fungal (Fungi), Archaea (Archaea) and viral (Virus) sequences extracted from NCBI's NR (Version:2018.01) database using DIAMOND software (blastp, evalue < ═ 1 e-5);
2) and (3) filtering comparison results: selecting the alignment result of evalue < (minimum evalue x 10) for subsequent analysis of the alignment result of each sequence;
3) after filtering, adopting an LCA algorithm (applied to system classification of MEGAN software), and taking the classification level before the first branch as species annotation information of each sequence;
4) obtaining abundance information and gene number information of each sample on each classification level (kingdom and compendium genus species) from an LCA annotation result and a gene abundance table;
5) starting from the abundance table at each classification level (kingdom compendium species), Krona analysis, relative abundance profile display, abundance cluster heat map display, PCA and NMDS dimensionality reduction analysis, anosims inter (intra) group difference analysis, Metastat and LEfSe multivariate statistical analysis of inter-group difference species were performed.
3. Construction of classification models
And establishing a machine learning classification model by using the microbial species abundance information table obtained by the process.
Selecting different quantities of intestinal microbial characteristics based on XGboost (extreme Gradient boosting) to classify the Pinyin Han population and the plateau Han population, finally taking the average value of AUC values (area under ROC curve) by using a ten-fold cross validation mode, and finally screening out 26 characteristics contained in an optimal classification model: s _ bacteria intestinalis CAG _564, s _ Bifidobacterium binary CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium subintium, s _ Bifidobacterium sulbtile, s _ Clostridium sp.36-4CPA, s _ Clostridium sp.BIDMC107, s _ Corynebacterium glomerans, s _ Eggerthella length, s _ Eggerthella sp.YYY 7918, s _ Enterobacteriaceae complex _ Hoffmann cluster IV, s _ Enterobacter hoechei, s _ Enterobacter sp.BWH52, s _ Enterobacter sp.MGH25, s _ Eubacterium G _12, Eubacterium hocaul 2. Klebs _ Microbacterium sp.1, S _ Bifidobacterium sp.1, Klebsiella _ Microbacterium sp.1. sp.16.
Third, experimental results
The model constructed based on the 26 features is the optimal model. FIG. 1 is a graph of the contribution value results for each feature; FIG. 2 is a graph of feature numbers versus AUC values.
FIG. 3 shows the ROC curve of the optimal model, AUC is 0.98 + -0.02, P is less than 0.01, which shows that the model constructed by these microorganisms can accurately distinguish between Pinyin Han and plateau Han.
The preferred embodiments of the present application have been described in detail with reference to the accompanying drawings, however, the present application is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and these simple modifications are all within the protection scope of the present application.
It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in the present application.
In addition, any combination of the various embodiments of the present application is also possible, and the same should be considered as disclosed in the present application as long as it does not depart from the idea of the present application.

Claims (10)

1. A system for identifying or differentiating people in different altitudes, comprising the following units:
1) a detection unit: comprises a microorganism detection module;
2) an analysis unit: the abundance level of the microorganism detected by the detection unit is used as an input variable and is input into classification models of people in different altitude areas for analysis;
3) an evaluation unit: outputting the probability value of the individual corresponding to the sample as the plain crowd/the plateau crowd;
the microorganism is selected from s _ bacteria intestinalis CAG _564, s _ Bifidobacterium duplex CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium mericum, s _ Bifidobacterium sub, s _ Clostridium sp.36-4CPA, s _ Clostridium sp.BIC107, s _ Coriobacterium glomerans, s _ Eggerthella length, s _ Eggerthella sp.Y7918, s _ Enterobacter cloacae _ filler _ Hoffmann IV, s _ Enterobacter Hoffmann H52, s _ Enterobacter sp.MGH25, S _ Enterobacter sp.12, S _ Enterobacter hoffm, S _ Enterobacter sp.BWH52, S _ Enterobacter sp.MGH25, S _ Lactobacillus sp.12, C _ bacteria sp.1, C _ bacterial sp.1, S _ Lactobacillus sp.777, C _ bacterial sp.1, S _ Lactobacillus sp.1, S _ Bacillus sp.202, S _ Bacillus _ strain.
2. The application of microorganisms in constructing classification models of people in different altitudes is characterized in that the microorganisms are selected from s _ bacteria intestinalis CAG _564, s _ Bifidobacterium binary CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium mericum, s _ Bifidobacterium subintidium, s _ Bifidobacterium subintile, s _ Clostridium sp.36-4CPA, s _ Clostridium biodC 107, s _ Corynebacterium globosum, s _ Egghella, s _ Eggella sp.YYYYYYY 7918, s _ Enterobacter cloacae _ Hoffmann clner IV, s _ Enterobacter faecalis, s _ Enterobacter sp.Hsp.52, Enterobacter clyceae complex _ Hoffmann clcer IV, S _ Enterobacter sp.25, Klebsiella _ Microbacterium sp.1, Klebsiella _ Microbacterium _1, Klebsiella _ Microbacterium _777, Klebsiella _ Microbacterium _ 1.20, Klebsiella _ Microbacterium _ P.20, Klebsiella _ Microbacterium _ P.sp.20, Microbacterium _ P.sp.sp.sp.sp.sp.20, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Microbacterium, Microbacterium _ Micro.
3. The use according to claim 2, wherein the classification model is determined using one or more algorithms selected from the group consisting of: xgboost (xgb), Random Forest (RF), glmnet, cforest, machine-learned classification and regression tree (CART), treebag, K-neighborhood (kNN), neural network (nnet), support vector machine radial (SVM-radial), support vector machine linear (SVM-linear), Naive Bayes (NB), or multi-layered perception (mlp).
4. A composition for identifying or differentiating populations at different altitudes comprising reagents for measuring abundance levels of microorganisms, the microorganism is selected from the group consisting of s _ bacteria internestinal CAG _564, s _ Bifidobacterium duplex CAG _234, s _ Bifidobacterium long, s _ Bifidobacterium mericum, s _ Bifidobacterium sub, s _ Clostridium sub), s _ Clostridium sp.36-4CPA, s _ Clostridium sp.BIC107, s _ Corynebacterium globosum, s _ Eggerthella lens, s _ Eggerthella sp.Y7918, s _ Enterobacter cloacae complex _ Hoffmann purifier IV, s _ Enterobacter hoecheci, s _ Enobacter sp.BWH52, s _ Enterobacter sp.MGH25, s _ Enterobacter g 12, S _ Enterobacter hoecheci, S _ Enterobacter sp.BWH52, S _ Enterobacter sp.MGH25, S _ Lactobacillus sp.12, S _ Enterobacter sp.14, Klebsiella _ Clostridium sp.1-4, Klebsiella sp.1-4 Klebsiella, S _ Bifidobacterium sp.21-4, S _ Bacillus sp.sp.21-Klebsiella, or Klebsiella _ Bacillus sp.sp.sp.1-4.
5. The reagent of claim 4, wherein the reagent comprises a reagent for measuring the abundance level of a microorganism by 16S rRNA sequencing, whole genome sequencing, quantitative polymerase chain reaction, PCR-pyrosequencing, fluorescence in situ hybridization, microarray or PCR-ELISA.
6. The agent according to claim 4, wherein the agent comprises a primer, a probe, an antisense oligonucleotide, an aptamer or an antibody.
7. Use of a composition according to any one of claims 4 to 6 for the preparation of a means for identifying or differentiating populations of different altitudes.
8. The use of claim 7, wherein the means comprises a chip, a kit, a strip or a high throughput sequencing platform.
9. A method for identifying or differentiating populations at different altitudes, said method comprising detecting the abundance of microorganisms, the microorganism is selected from s _ bacteria intestinalis CAG _564, s _ Bifidobacterium double CAG _234, s _ Bifidobacterium longum, s _ Bifidobacterium subintibacter, s _ Bifidobacterium subtile, s _ Clostridium sp.36-4CPA, s _ Clostridium sp.BIDMC107, s _ Corynebacterium globeans, s _ Eggerthella length, s _ Eggerthella Y7918, s _ Enterobacteriaceae complex _ Hoffmann CLUSTER IV, s _ Enterobacteriaceae, s _ Engineer sp.BWH52, s _ Enterobacter sp.MGH25, Eubacterium clysis, S _ Enterobacter hobacter hostesi, S _ Enterobacter hobacter sp.12, S _ Enterobacter hobacter sp.V.20, S _ Enterobacter sp.W.P.52, S _ Enterobacter sp.MGH25, S _ Lactobacillus sp.12, Klebsiella _ Lactobacillus sp.1, Klebsiella _ Bacillus _ Microbacterium _1, S _ Lactobacillus sp.20, S _ Lactobacillus sp.1, S _ Lactobacillus sp.20, Klebsiella _ Lactobacillus sp.1, S _ Lactobacillus sp.1. or Klebsiella _ Lactobacillus sp.1. sp.3. sp.1. 4. sp.3. coli.
10. The method of claim 9, wherein said different elevations comprise plateau and plateau regions.
CN202210221736.6A 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof Active CN114566224B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210221736.6A CN114566224B (en) 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210221736.6A CN114566224B (en) 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof

Publications (2)

Publication Number Publication Date
CN114566224A true CN114566224A (en) 2022-05-31
CN114566224B CN114566224B (en) 2023-08-11

Family

ID=81717079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210221736.6A Active CN114566224B (en) 2022-03-09 2022-03-09 Model for identifying or distinguishing people at different altitudes and application thereof

Country Status (1)

Country Link
CN (1) CN114566224B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115274123A (en) * 2022-07-15 2022-11-01 中国人民解放军总医院 Physical ability level prediction method, system, device, medium, and program product

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180080065A1 (en) * 2016-09-15 2018-03-22 Sun Genomics, Inc. Universal method for extracting nucleic acid molecules from a diverse population of one or more types of microbes in a sample
CN109652493A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 The bacillus gram that quivers, which belongs to, is identifying and/or is distinguishing the application in not agnate individual
CN109652570A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual
CN109825561A (en) * 2019-01-16 2019-05-31 中国人民解放军总医院 Quasi- Prey irrigates Pseudomonas and is identifying and/or distinguishing the application in not agnate individual
CN109913525A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Butyrivibrio is identifying and/or is distinguishing the application in highlands Chinese Han Population and Tibetan populations
CN109913526A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual
US20210224615A1 (en) * 2020-01-21 2021-07-22 Axis Ab Distinguishing - in an image - human beings in a crowd
US20210388416A1 (en) * 2018-10-26 2021-12-16 Sun Genomics, Inc. Universal method for extracting nucleic acid molecules from a diverse population of microbes

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180080065A1 (en) * 2016-09-15 2018-03-22 Sun Genomics, Inc. Universal method for extracting nucleic acid molecules from a diverse population of one or more types of microbes in a sample
US20210388416A1 (en) * 2018-10-26 2021-12-16 Sun Genomics, Inc. Universal method for extracting nucleic acid molecules from a diverse population of microbes
CN109652493A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 The bacillus gram that quivers, which belongs to, is identifying and/or is distinguishing the application in not agnate individual
CN109652570A (en) * 2019-01-16 2019-04-19 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual
CN109825561A (en) * 2019-01-16 2019-05-31 中国人民解放军总医院 Quasi- Prey irrigates Pseudomonas and is identifying and/or distinguishing the application in not agnate individual
CN109913525A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Butyrivibrio is identifying and/or is distinguishing the application in highlands Chinese Han Population and Tibetan populations
CN109913526A (en) * 2019-02-13 2019-06-21 中国人民解放军总医院 Microorganism is identifying and/or is distinguishing the application in not agnate individual
US20210224615A1 (en) * 2020-01-21 2021-07-22 Axis Ab Distinguishing - in an image - human beings in a crowd

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
KANG LI 等: "Comparative Analysis of Gut Microbiota of Native Tibetan and Han Populations Living at Different Altitudes", 《PLOS ONE》 *
KANG LI 等: "Comparative Analysis of Gut Microbiota of Native Tibetan and Han Populations Living at Different Altitudes", 《PLOS ONE》, vol. 11, no. 5, 27 May 2016 (2016-05-27), pages 1 - 16 *
LONG LI 等: "Comparative analyses of fecal microbiota in Tibetan and Chinese Han living at low or high altitude by barcoded 454 pyrosequencing", 《SCIENTIFICREPORTS》, pages 1 - 10 *
LULU ZHU 等: "Distinct Features of Gut Microbiota in High-Altitude Tibetan and Middle-Altitude Han Hypertensive Patients", 《HINDAWI》 *
LULU ZHU 等: "Distinct Features of Gut Microbiota in High-Altitude Tibetan and Middle-Altitude Han Hypertensive Patients", 《HINDAWI》, 21 November 2020 (2020-11-21), pages 1 - 15 *
刘贵琴 等: "高原低氧条件下的肠道菌群与药物代谢", 《药学研究》 *
刘贵琴 等: "高原低氧条件下的肠道菌群与药物代谢", 《药学研究》, vol. 38, no. 12, 31 December 2019 (2019-12-31), pages 714 - 718 *
吴国军: "以微生物基因组为核心探究肠道菌群、饮食与人 体健康的互作", 《中国博士论文全文数据库基础学科辑》, no. 01, pages 006 - 248 *
董文学 等: "青藏高原人群微生物组研究进展", 《中国高原医学与生物学杂志》 *
董文学 等: "青藏高原人群微生物组研究进展", 《中国高原医学与生物学杂志》, vol. 42, no. 4, 31 December 2021 (2021-12-31), pages 286 - 288 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115274123A (en) * 2022-07-15 2022-11-01 中国人民解放军总医院 Physical ability level prediction method, system, device, medium, and program product
CN115274123B (en) * 2022-07-15 2023-03-24 中国人民解放军总医院 Physical ability level prediction method, system, device, medium, and program product

Also Published As

Publication number Publication date
CN114566224B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
Joosten et al. Identification of biomarkers for tuberculosis disease using a novel dual-color RT–MLPA assay
CN108368551B (en) Method for diagnosing tuberculosis
US20110183856A1 (en) Diagnosis and Prognosis of Infectious Disease Clinical Phenotypes and other Physiologic States Using Host Gene Expression Biomarkers In Blood
JP2020513856A (en) Leveraging Sequence-Based Fecal Microbial Survey Data to Identify Multiple Biomarkers for Colorectal Cancer
US20120183969A1 (en) Immunodiversity Assessment Method and Its Use
Deshpande et al. Multiplexed nucleic acid-based assays for molecular diagnostics of human disease
US20150100242A1 (en) Method, kit and array for biomarker validation and clinical use
US20220073996A1 (en) Model for predicting treatment responsiveness based on intestinal microbial information
WO2020061072A1 (en) Method of characterizing a neurodegenerative pathology
CN114566224B (en) Model for identifying or distinguishing people at different altitudes and application thereof
Sharma et al. Exploring the Genetic Basis of Tuberculosis Susceptibility in Human Populations
WO2020021028A1 (en) Biomarkers for the diagnosis and/or prognosis of frailty
Khademi et al. Molecular and genotyping techniques in diagnosis of Coxiella burnetii: An overview
CN112063709B (en) Diagnosis kit for myasthenia gravis by taking microorganisms as diagnosis markers and application
US20220148690A1 (en) Immunorepertoire wellness assessment systems and methods
CN102766573A (en) Gene group detection structure
CN114839369B (en) Acute altitude stress microbial marker and application thereof
JP2021175381A (en) Method for detecting infant atopic dermatitis
CN112226501B (en) Intestinal flora marker for myasthenia gravis and application thereof
CN111996248B (en) Reagent for detecting microorganism and application thereof in diagnosis of myasthenia gravis
CN112226525B (en) Reagent for diagnosing myasthenia gravis
CN114736970B (en) Method for identifying different crowds
CN112634983B (en) Pathogen species specific PCR primer optimization design method
US20230295749A1 (en) Methods and systems for detecting and discriminating between viral variants
RU2795410C2 (en) Biomarker panel and methods for detecting microsatellite instability in various types of cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant