CN111524555A - Automatic typing method based on human intestinal flora - Google Patents

Automatic typing method based on human intestinal flora Download PDF

Info

Publication number
CN111524555A
CN111524555A CN202010313064.2A CN202010313064A CN111524555A CN 111524555 A CN111524555 A CN 111524555A CN 202010313064 A CN202010313064 A CN 202010313064A CN 111524555 A CN111524555 A CN 111524555A
Authority
CN
China
Prior art keywords
intestinal flora
human intestinal
analysis
cluster
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010313064.2A
Other languages
Chinese (zh)
Inventor
王树伟
肖云平
史贤俊
林博
张建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Oe Biotech Co ltd
Original Assignee
Shanghai Oe Biotech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Oe Biotech Co ltd filed Critical Shanghai Oe Biotech Co ltd
Priority to CN202010313064.2A priority Critical patent/CN111524555A/en
Publication of CN111524555A publication Critical patent/CN111524555A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Abstract

The invention discloses an automatic typing method based on human intestinal flora, which is characterized in that a LefSe mode is adopted to group clustering results for Biomarker screening, then a specific intestinal type is determined, the results are comprehensive, the clustering chart, the Biomarker screening and the intestinal type boxplot display are included, all analysis results can be automatically sorted, the results are automatically gathered and counted and visualized after each step of analysis is completed, all operation steps can be traced, error inquiry is facilitated, and if the analysis is carried out in an error report, corresponding error report log information exists.

Description

Automatic typing method based on human intestinal flora
Technical Field
The invention relates to the field of high-throughput microbial sequencing, in particular to an automatic typing method based on human intestinal flora.
Background
In 2011, a scientific research institution in europe has analyzed the composition of intestinal microorganisms of 22 european people by using the difference of a bacterial gene, and has identified the composition of the microbial ecological group which is different between every two people and in the same person. Moreover, they also compared the microbial ecoset composition patterns of these europeans with those of japanese and american discovered earlier. As a result, it was found that the microbial ecogroups are not randomly combined, and the microbial ecogroups can be roughly classified into three types, also called enterotypes (enterotypes), among all the tested human groups, and scientists specifically classify them into Bacteroides type (Bacteroides), Prevotella type (Prevotella) and Ruminococcus type (Ruminococcus), which means that they respectively contain more Bacteroides, Prevotella or Ruminococcus. The same conclusion was reached by investigating a larger population (154 us and 85 danish), which could be divided into these three categories, which suggests that the possible number of microbial ecogroups that really survive very well in our intestine is not too great.
The MetaHIT alliance published a gut pattern found in the human gut microbiota in 2011 at 4 months (Arumugam, Raes et al, 2011). The data of the relevant research is public, and the theory behind the calculation process is explained in the supplementary information of the article. However, there is no report in the appendix (in the R environment) of the exact command set and a complete visual presentation of the specific gut type identification method that would enable anyone to replicate all the data in the article.
The existing intestinal type identification has the following defects:
(1) the intestinal type identification is ambiguous: the method for identifying the intestinal type of each clustering result is not clear;
(2) the results are shown incomplete: the analysis result is too simple, data mining is not deep enough, and visual display content corresponding to the data is lacked.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide an automatic typing method based on human intestinal flora.
In order to achieve the purpose, the invention adopts the scheme that:
an automatic typing method based on human intestinal flora comprises the following steps:
1) preparing a genus-level species relative abundance table of all samples;
2) partitioning by a surrounding central point Partitioning Algorithm (PAM), clustering abundance distribution, and screening the optimal clustering number by using a Calinski-Harabasz (CH/Carlinsky-Harabas) index;
3) verifying the clustering effect by a contour verification technology;
4) performing BCA inter-class analysis according to the optimal clustering number;
5) the species in each group that contributed most to the difference were selected as the gut type of each group by LEfSe analysis,
and a boxplot is drawn.
Preferably, in step 2), the Calinski-Harabasz index is defined as:
Figure BDA0002458405720000021
wherein B iskIs the sum of squares between clusters, WkIs the intra-cluster sum of squares, selected to be CKkThe number of k clusters with the largest value.
Preferably, in step 3), the contour width s (i) of each data point i is calculated by the following formula:
Figure BDA0002458405720000022
where a (i) is the average difference (or distance) of sample i from all other samples in the same cluster, b (i) is the average difference (or distance) of sample i from all objects in the nearest cluster,
the formula indicates that-1 ═ s (i) <1, a sample closer to its cluster has a higher value of s (i) than to its own cluster, whereas s (i) is close to 0 meaning that the given sample is located between the two clusters, and a large negative value of s (i) indicates that the sample is assigned to the wrong cluster.
Preferably, in step 4), the BCA inter-class assay is performed using R and ade4 packaging.
Preferably, in step 5), the LDA score is obtained by detecting the difference function between different components by a rank sum test method and by implementing dimensionality reduction and evaluating the influence magnitude of different species by LDA (linear discriminant analysis).
Preferably, in step 5), the intestinal form is designated as the G plus numerical form.
Preferably, in step 5), LEfSe analysis procedure is adopted to find out the biomarkers with significance among different clusters.
Preferably, in step 5), the boxplot is drawn by using the ggplot2 software package in the R language.
The invention has the beneficial effects that:
and (I) screening a clustering result by using a LefSe mode through a Biomarker, and then determining a specific intestinal type.
And (II) the results are comprehensive and comprise related clustering maps, Biomarker screening and intestinal type boxplot display.
And (III) automatically sorting all analysis results, and automatically summarizing and counting the results after each step of analysis is finished, so that the results are visualized.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
FIG. 2 is a graph of the optimal cluster number selection according to the present invention.
FIG. 3 is a graph of inter-class analysis clusters of the present invention.
FIG. 4 is a graph of inter-class analysis clusters with sample names according to the present invention.
FIG. 5 is a Biomarker bar graph of the present invention.
FIG. 6 is a diagram of an enteric boxplot of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be described below clearly and completely with reference to the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of the present invention. In addition, technical features of various embodiments or individual embodiments provided by the present invention may be arbitrarily combined with each other to form a feasible technical solution, but must be based on the realization of the technical solution by a person skilled in the art, and when the technical solution combination is contradictory or cannot be realized, the technical solution combination should be considered to be absent and not to be within the protection scope of the present invention.
The invention provides an automatic typing method based on human intestinal flora, which is shown in figure 1 and comprises the following steps:
1. a file preparation step:
and obtaining a genus level species relative abundance table of different population samples according to high-throughput sequencing.
2. Selecting the optimal clustering number:
the invention uses a surrounding central point segmentation algorithm (PAM) to perform partitioning and cluster the abundance distribution. PAM is derived from the basic k-means algorithm, but has the advantage of supporting arbitrary distance measurements and is more straightforward than k-means. It is a supervised process in which a predetermined number of clusters is used as input to the process, which then divides the data into a plurality of clusters.
To evaluate the optimal number of clusters, the present invention uses the Calinski-Harabasz (CH/carlinsky-hardabas) index, which reveals good performance in recovering the number of clusters. Is defined as:
Figure BDA0002458405720000041
wherein B iskIs the sum of squares between clusters (i.e., the distance squares i and j between all points are not in the same cluster), WkIs the intra-cluster sum of squares (i.e., the distance squares i and j between all points are in the same cluster). This metric implements the idea of: when the distance between clusters is much greater than the distance inside the clusters, the better the clustering effect. Therefore, we choose to make CKkThe number of k clusters with the largest value.
3. A step of verifying clustering effect:
the cluster verification method is very useful for evaluating the cluster quality associated with the underlying data points. The invention herein uses a contour verification technique. The contour width s (i) of each data point i is calculated by:
Figure BDA0002458405720000042
where a (i) is the average difference (or distance) of sample i from all other samples in the same cluster, and b (i) is the average difference (or distance) of sample i from all objects in the nearest cluster.
The formula indicates-1 ═ < s (i) ═ 1. A sample closer to its own cluster has a higher value of s (i) than s (i), whereas s (i) approaching 0 means that the given sample is located between the two clusters. A large negative value of S (i) indicates that the sample is assigned to the wrong cluster.
4. An inter-class analysis (BCA) step:
inter-class analysis (BCA) was performed to support clustering and to determine drivers of gut type. Analysis was performed using R and ade4 packaging. Prior to this analysis, in the Illumina dataset, if the average abundance of all samples was below 0.01%, very low abundance genera were removed to reduce noise. Inter-class analysis is a special case of principal component analysis, in which there is a tool variable that is a qualitative factor (i.e., gut type cluster). Inter-class analysis enables us to find the principal component first.
5. An inter-cluster LEfSe analysis step:
in order to screen functional biomarkers with significant differences among clusters, firstly, the difference functions among different components are detected by a rank sum test method, dimension reduction is realized by LDA (linear discriminant analysis), and the influence of different species is evaluated, so that the LDA score is obtained.
6. Displaying an intestinal type boxplot:
the species in each group that contributed most to the difference was selected as the gut type of each group by LEfSe analysis, and boxplot was drawn.
Examples
In order to show the practicability of the claimed technical solution of the present invention, the practical application of the present invention is further described below by taking the application of the metagenome data of the human intestinal bacteria in different countries as an example, it should be noted that the example is only for showing the spirit of the technical solution of the present invention more clearly, and does not represent the limitation of the technical solution of the present invention, and all the technical solutions according to the spirit of the present invention are within the protection scope of the patent.
Taking the application of human intestinal bacteria metagenome data in different countries as an example:
downloading to obtain an initial data set containing 33 samples, and downloading the address: https:// energy type. The data set contains a abundance table of the levels of human intestinal microorganisms in different countries and regions.
Partitioning was performed using the surrounding center point segmentation algorithm (PAM), clustering the abundance distribution and screening the optimal cluster number using the Calinski-Harabasz (CH/carlinsky-hardabas) index, see fig. 2.
The best clusters are verified according to the contour verification technique, and inter-class analysis (BCA analysis) is performed according to the number of best clusters, see fig. 3 and 4.
The intersubcluster biomarkers were screened by LEfSe analysis as their gut type, see fig. 5; finally, an intestinal boxplot is plotted, see fig. 6.
In a preferred embodiment of the present invention, the differences between clusters LEfSe analysis step and the enterotype boxplot analysis step are in the form of enterotypes named G plus numbers.
In a preferred embodiment of the present invention, in the step of analyzing the inter-cluster difference LEfSe, an LEfSe analysis process is adopted to find out a Biomarker with significance between different clusters.
In a preferred embodiment of the present invention, the images are rendered by using the ggplot2 software package in the R language in the between-class analysis (BCA) step and the bowel-type boxplot analysis step.
In addition, it should be noted that the steps of the embodiments of the present invention may be integrated, and may be implemented in the linux system by only one command. The scripts are connected in series by using the shell language, and the algorithm and drawing are mainly used for R package ade4 and ggplot 2.

Claims (8)

1. An automatic typing method based on human intestinal flora comprises the following steps:
1) preparing a genus-level species relative abundance table of all samples;
2) partitioning is carried out through a segmentation algorithm surrounding a central point, abundance distribution is clustered, and the best clustering number is screened by using a Calinski-Harabasz index;
3) verifying the clustering effect by a contour verification technology;
4) performing BCA inter-class analysis according to the optimal clustering number;
5) the species contributing most to the difference in each group was screened by LEfSe analysis as the gut type of each group and boxplot was drawn.
2. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 2), the Calinski-Harabasz index is defined as:
Figure FDA0002458405710000011
wherein B iskIs the sum of squares between clusters, WkIs the intra-cluster sum of squares, selected to be CKkThe number of k clusters with the largest value.
3. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 3), the contour width s (i) of each data point i is calculated by the following formula:
Figure FDA0002458405710000012
where a (i) is the average difference or distance of sample i from all other samples in the same cluster, b (i) is the average difference or distance of sample i from all objects in the nearest cluster,
the formula indicates that-1 ═ s (i) <1, a sample closer to its cluster has a higher value of s (i) than to its own cluster, whereas s (i) is close to 0 meaning that the given sample is located between the two clusters, and a large negative value of s (i) indicates that the sample is assigned to the wrong cluster.
4. The method for the automated typing of human intestinal flora based on claim 1, wherein in step 4), the BCA intergeneric analysis is performed using R and ade4 packages.
5. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 5), the LDA score is obtained by detecting the difference function between different components through a rank-sum test and performing the dimension reduction through Linear Discriminant Analysis (LDA) and evaluating the influence of different species.
6. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 5), the intestine type is named as G plus numeral form.
7. The method for the automated typing of human intestinal flora according to claim 1,
and 5) finding out the significant biomarkers among different clusters by adopting an LEfSe analysis process.
8. The method for the automated typing of human intestinal flora based on the claims 1 to 7, wherein in step 5), the software package ggplot2 in R language is used to draw a boxplot.
CN202010313064.2A 2020-04-20 2020-04-20 Automatic typing method based on human intestinal flora Pending CN111524555A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010313064.2A CN111524555A (en) 2020-04-20 2020-04-20 Automatic typing method based on human intestinal flora

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010313064.2A CN111524555A (en) 2020-04-20 2020-04-20 Automatic typing method based on human intestinal flora

Publications (1)

Publication Number Publication Date
CN111524555A true CN111524555A (en) 2020-08-11

Family

ID=71901704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010313064.2A Pending CN111524555A (en) 2020-04-20 2020-04-20 Automatic typing method based on human intestinal flora

Country Status (1)

Country Link
CN (1) CN111524555A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114203256A (en) * 2022-02-18 2022-03-18 上海仁东医学检验所有限公司 MIBC typing and prognosis prediction model construction method based on microbial abundance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109897906A (en) * 2019-03-04 2019-06-18 福建西陇生物技术有限公司 A kind of detection method and its application of intestinal flora 16S rRNA gene
CN109933984A (en) * 2019-02-15 2019-06-25 中时瑞安(北京)网络科技有限责任公司 A kind of best cluster result screening technique, device and electronic equipment
CN110423804A (en) * 2019-08-12 2019-11-08 中国福利会国际和平妇幼保健院 A kind of the biomarker set and screening method of screening missed abortion risk
CN110607262A (en) * 2019-09-25 2019-12-24 君维安(武汉)生命科技有限公司 Probiotic composition for intervening inflammatory enteritis and screening method and application thereof
CN110734989A (en) * 2019-11-06 2020-01-31 华中科技大学鄂州工业技术研究院 medicinal plant symbiotic microorganism identification method and application thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933984A (en) * 2019-02-15 2019-06-25 中时瑞安(北京)网络科技有限责任公司 A kind of best cluster result screening technique, device and electronic equipment
CN109897906A (en) * 2019-03-04 2019-06-18 福建西陇生物技术有限公司 A kind of detection method and its application of intestinal flora 16S rRNA gene
CN110423804A (en) * 2019-08-12 2019-11-08 中国福利会国际和平妇幼保健院 A kind of the biomarker set and screening method of screening missed abortion risk
CN110607262A (en) * 2019-09-25 2019-12-24 君维安(武汉)生命科技有限公司 Probiotic composition for intervening inflammatory enteritis and screening method and application thereof
CN110734989A (en) * 2019-11-06 2020-01-31 华中科技大学鄂州工业技术研究院 medicinal plant symbiotic microorganism identification method and application thereof

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
MOELLER A H , DEGNAN P H, PUSEY A E , ET AL: "Chimpanzees and humans harbour compositionally similar gut enterotypes", 《NATURE COMMUNICATIONS》 *
任广旭等: "《肠道菌群与精准营养健康》", 31 March 2019 *
姚曦: "基于改进K-Means的大学生体质健康评价细分模型研究", 《软件导刊》 *
小闫同学啊: "机器学习算法之聚类算法", 《微信公众号:全栈技术精选》 *
徐佳莹等: "三种淡水鱼类在中国南北两个地区的肠道菌群差异比较", 《激光生物学报》 *
欧易生物: "【微生态专题】肠道微生态数据分析深度挖掘", 《微信公众号:欧易生物》 *
王侠林等: "基于概率话题模型的微生物菌群结构研究", 《中国科学:生命科学》 *
用户1662038893: "【转载】Enterotyping Tutorial", 《新浪博客》 *
苏鑫等: "3种碳添加对退化农田土壤固碳细菌群落结构多样性的影响", 《环境科学学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114203256A (en) * 2022-02-18 2022-03-18 上海仁东医学检验所有限公司 MIBC typing and prognosis prediction model construction method based on microbial abundance

Similar Documents

Publication Publication Date Title
EP3938948A1 (en) Multiple instance learner for prognostic tissue pattern identification
CN106777070B (en) Web record link system and method based on block
US20210350283A1 (en) Data analyzer
CN106843941A (en) Information processing method, device and computer equipment
CN111950645A (en) Method for improving class imbalance classification performance by improving random forest
CN108642568B (en) Method for designing SNP chip special for identifying low-density breed of whole genome of domestic dog
CN111524555A (en) Automatic typing method based on human intestinal flora
CN111863135B (en) False positive structure variation filtering method, storage medium and computing device
CN113096737A (en) Method and system for automatically analyzing pathogen types
Qin et al. Statistical assessment of depth normalization for small RNA sequencing
CN110222981B (en) Reservoir classification evaluation method based on parameter secondary selection
Bruce Bagwell High-Dimensional Modeling for Cytometry: Building Rock Solid Models Using GemStone™ and Verity Cen-se’™ High-Definition t-SNE Mapping
KR102397822B1 (en) Apparatus and method for analyzing cells using chromosome structure and state information
CN113380318B (en) Artificial intelligence assisted flow cytometry 40CD immunophenotyping detection method and system
CN103092854B (en) Music data sorting method
Rossel et al. Unsupervised biodiversity estimation using proteomic fingerprints from MALDI‐TOF MS data
CN111159465A (en) Song classification method and device
Carroll et al. Assessing ChIP-seq sample quality with ChIPQC
CN111061703A (en) Test method for improving data verification quality of database
CN116153411B (en) Design method and application of multi-pathogen probe library combination
CN114496089B (en) Pathogenic microorganism identification method
CN112529084B (en) Similar landslide recommendation method based on landslide section image classification model
CN114203261A (en) Method for developing gene detection Panel clinical diagnosis index algorithm
CN118039028A (en) Method and system for intelligently identifying acidic rock magma type based on apatite components
CN112711665A (en) Log anomaly detection method based on density weighted integration rule

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Zhang Jianming

Inventor after: Xiao Yunping

Inventor after: Wang Shuwei

Inventor after: Shi Xianjun

Inventor after: Lin Bo

Inventor after: Liu Yuchuan

Inventor before: Wang Shuwei

Inventor before: Xiao Yunping

Inventor before: Shi Xianjun

Inventor before: Lin Bo

Inventor before: Zhang Jianming

CB03 Change of inventor or designer information
RJ01 Rejection of invention patent application after publication

Application publication date: 20200811

RJ01 Rejection of invention patent application after publication