CN111524555A

CN111524555A - Automatic typing method based on human intestinal flora

Info

Publication number: CN111524555A
Application number: CN202010313064.2A
Authority: CN
Inventors: 王树伟; 肖云平; 史贤俊; 林博; 张建明
Original assignee: Shanghai Oe Biotech Co ltd
Current assignee: Shanghai Oe Biotech Co ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-08-11

Abstract

The invention discloses an automatic typing method based on human intestinal flora, which is characterized in that a LefSe mode is adopted to group clustering results for Biomarker screening, then a specific intestinal type is determined, the results are comprehensive, the clustering chart, the Biomarker screening and the intestinal type boxplot display are included, all analysis results can be automatically sorted, the results are automatically gathered and counted and visualized after each step of analysis is completed, all operation steps can be traced, error inquiry is facilitated, and if the analysis is carried out in an error report, corresponding error report log information exists.

Description

Automatic typing method based on human intestinal flora

Technical Field

The invention relates to the field of high-throughput microbial sequencing, in particular to an automatic typing method based on human intestinal flora.

Background

In 2011, a scientific research institution in europe has analyzed the composition of intestinal microorganisms of 22 european people by using the difference of a bacterial gene, and has identified the composition of the microbial ecological group which is different between every two people and in the same person. Moreover, they also compared the microbial ecoset composition patterns of these europeans with those of japanese and american discovered earlier. As a result, it was found that the microbial ecogroups are not randomly combined, and the microbial ecogroups can be roughly classified into three types, also called enterotypes (enterotypes), among all the tested human groups, and scientists specifically classify them into Bacteroides type (Bacteroides), Prevotella type (Prevotella) and Ruminococcus type (Ruminococcus), which means that they respectively contain more Bacteroides, Prevotella or Ruminococcus. The same conclusion was reached by investigating a larger population (154 us and 85 danish), which could be divided into these three categories, which suggests that the possible number of microbial ecogroups that really survive very well in our intestine is not too great.

The MetaHIT alliance published a gut pattern found in the human gut microbiota in 2011 at 4 months (Arumugam, Raes et al, 2011). The data of the relevant research is public, and the theory behind the calculation process is explained in the supplementary information of the article. However, there is no report in the appendix (in the R environment) of the exact command set and a complete visual presentation of the specific gut type identification method that would enable anyone to replicate all the data in the article.

The existing intestinal type identification has the following defects:

(1) the intestinal type identification is ambiguous: the method for identifying the intestinal type of each clustering result is not clear;

(2) the results are shown incomplete: the analysis result is too simple, data mining is not deep enough, and visual display content corresponding to the data is lacked.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an automatic typing method based on human intestinal flora.

In order to achieve the purpose, the invention adopts the scheme that:

an automatic typing method based on human intestinal flora comprises the following steps:

1) preparing a genus-level species relative abundance table of all samples;

2) partitioning by a surrounding central point Partitioning Algorithm (PAM), clustering abundance distribution, and screening the optimal clustering number by using a Calinski-Harabasz (CH/Carlinsky-Harabas) index;

3) verifying the clustering effect by a contour verification technology;

4) performing BCA inter-class analysis according to the optimal clustering number;

5) the species in each group that contributed most to the difference were selected as the gut type of each group by LEfSe analysis,

and a boxplot is drawn.

Preferably, in step 2), the Calinski-Harabasz index is defined as:

wherein B is_kIs the sum of squares between clusters, W_kIs the intra-cluster sum of squares, selected to be CK_kThe number of k clusters with the largest value.

Preferably, in step 3), the contour width s (i) of each data point i is calculated by the following formula:

where a (i) is the average difference (or distance) of sample i from all other samples in the same cluster, b (i) is the average difference (or distance) of sample i from all objects in the nearest cluster,

the formula indicates that-1 ═ s (i) <1, a sample closer to its cluster has a higher value of s (i) than to its own cluster, whereas s (i) is close to 0 meaning that the given sample is located between the two clusters, and a large negative value of s (i) indicates that the sample is assigned to the wrong cluster.

Preferably, in step 4), the BCA inter-class assay is performed using R and ade4 packaging.

Preferably, in step 5), the LDA score is obtained by detecting the difference function between different components by a rank sum test method and by implementing dimensionality reduction and evaluating the influence magnitude of different species by LDA (linear discriminant analysis).

Preferably, in step 5), the intestinal form is designated as the G plus numerical form.

Preferably, in step 5), LEfSe analysis procedure is adopted to find out the biomarkers with significance among different clusters.

Preferably, in step 5), the boxplot is drawn by using the ggplot2 software package in the R language.

The invention has the beneficial effects that:

and (I) screening a clustering result by using a LefSe mode through a Biomarker, and then determining a specific intestinal type.

And (II) the results are comprehensive and comprise related clustering maps, Biomarker screening and intestinal type boxplot display.

And (III) automatically sorting all analysis results, and automatically summarizing and counting the results after each step of analysis is finished, so that the results are visualized.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a graph of the optimal cluster number selection according to the present invention.

FIG. 3 is a graph of inter-class analysis clusters of the present invention.

FIG. 4 is a graph of inter-class analysis clusters with sample names according to the present invention.

FIG. 5 is a Biomarker bar graph of the present invention.

FIG. 6 is a diagram of an enteric boxplot of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be described below clearly and completely with reference to the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of the present invention. In addition, technical features of various embodiments or individual embodiments provided by the present invention may be arbitrarily combined with each other to form a feasible technical solution, but must be based on the realization of the technical solution by a person skilled in the art, and when the technical solution combination is contradictory or cannot be realized, the technical solution combination should be considered to be absent and not to be within the protection scope of the present invention.

The invention provides an automatic typing method based on human intestinal flora, which is shown in figure 1 and comprises the following steps:

1. a file preparation step:

and obtaining a genus level species relative abundance table of different population samples according to high-throughput sequencing.

2. Selecting the optimal clustering number:

the invention uses a surrounding central point segmentation algorithm (PAM) to perform partitioning and cluster the abundance distribution. PAM is derived from the basic k-means algorithm, but has the advantage of supporting arbitrary distance measurements and is more straightforward than k-means. It is a supervised process in which a predetermined number of clusters is used as input to the process, which then divides the data into a plurality of clusters.

To evaluate the optimal number of clusters, the present invention uses the Calinski-Harabasz (CH/carlinsky-hardabas) index, which reveals good performance in recovering the number of clusters. Is defined as:

wherein B is_kIs the sum of squares between clusters (i.e., the distance squares i and j between all points are not in the same cluster), W_kIs the intra-cluster sum of squares (i.e., the distance squares i and j between all points are in the same cluster). This metric implements the idea of: when the distance between clusters is much greater than the distance inside the clusters, the better the clustering effect. Therefore, we choose to make CK_kThe number of k clusters with the largest value.

3. A step of verifying clustering effect:

the cluster verification method is very useful for evaluating the cluster quality associated with the underlying data points. The invention herein uses a contour verification technique. The contour width s (i) of each data point i is calculated by:

where a (i) is the average difference (or distance) of sample i from all other samples in the same cluster, and b (i) is the average difference (or distance) of sample i from all objects in the nearest cluster.

The formula indicates-1 ═ < s (i) ═ 1. A sample closer to its own cluster has a higher value of s (i) than s (i), whereas s (i) approaching 0 means that the given sample is located between the two clusters. A large negative value of S (i) indicates that the sample is assigned to the wrong cluster.

4. An inter-class analysis (BCA) step:

inter-class analysis (BCA) was performed to support clustering and to determine drivers of gut type. Analysis was performed using R and ade4 packaging. Prior to this analysis, in the Illumina dataset, if the average abundance of all samples was below 0.01%, very low abundance genera were removed to reduce noise. Inter-class analysis is a special case of principal component analysis, in which there is a tool variable that is a qualitative factor (i.e., gut type cluster). Inter-class analysis enables us to find the principal component first.

5. An inter-cluster LEfSe analysis step:

in order to screen functional biomarkers with significant differences among clusters, firstly, the difference functions among different components are detected by a rank sum test method, dimension reduction is realized by LDA (linear discriminant analysis), and the influence of different species is evaluated, so that the LDA score is obtained.

6. Displaying an intestinal type boxplot:

the species in each group that contributed most to the difference was selected as the gut type of each group by LEfSe analysis, and boxplot was drawn.

Examples

In order to show the practicability of the claimed technical solution of the present invention, the practical application of the present invention is further described below by taking the application of the metagenome data of the human intestinal bacteria in different countries as an example, it should be noted that the example is only for showing the spirit of the technical solution of the present invention more clearly, and does not represent the limitation of the technical solution of the present invention, and all the technical solutions according to the spirit of the present invention are within the protection scope of the patent.

Taking the application of human intestinal bacteria metagenome data in different countries as an example:

downloading to obtain an initial data set containing 33 samples, and downloading the address: https:// energy type. The data set contains a abundance table of the levels of human intestinal microorganisms in different countries and regions.

Partitioning was performed using the surrounding center point segmentation algorithm (PAM), clustering the abundance distribution and screening the optimal cluster number using the Calinski-Harabasz (CH/carlinsky-hardabas) index, see fig. 2.

The best clusters are verified according to the contour verification technique, and inter-class analysis (BCA analysis) is performed according to the number of best clusters, see fig. 3 and 4.

The intersubcluster biomarkers were screened by LEfSe analysis as their gut type, see fig. 5; finally, an intestinal boxplot is plotted, see fig. 6.

In a preferred embodiment of the present invention, the differences between clusters LEfSe analysis step and the enterotype boxplot analysis step are in the form of enterotypes named G plus numbers.

In a preferred embodiment of the present invention, in the step of analyzing the inter-cluster difference LEfSe, an LEfSe analysis process is adopted to find out a Biomarker with significance between different clusters.

In a preferred embodiment of the present invention, the images are rendered by using the ggplot2 software package in the R language in the between-class analysis (BCA) step and the bowel-type boxplot analysis step.

In addition, it should be noted that the steps of the embodiments of the present invention may be integrated, and may be implemented in the linux system by only one command. The scripts are connected in series by using the shell language, and the algorithm and drawing are mainly used for R package ade4 and ggplot 2.

Claims

1. An automatic typing method based on human intestinal flora comprises the following steps:

1) preparing a genus-level species relative abundance table of all samples;

2) partitioning is carried out through a segmentation algorithm surrounding a central point, abundance distribution is clustered, and the best clustering number is screened by using a Calinski-Harabasz index;

3) verifying the clustering effect by a contour verification technology;

5) the species contributing most to the difference in each group was screened by LEfSe analysis as the gut type of each group and boxplot was drawn.

2. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 2), the Calinski-Harabasz index is defined as:

3. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 3), the contour width s (i) of each data point i is calculated by the following formula:

where a (i) is the average difference or distance of sample i from all other samples in the same cluster, b (i) is the average difference or distance of sample i from all objects in the nearest cluster,

4. The method for the automated typing of human intestinal flora based on claim 1, wherein in step 4), the BCA intergeneric analysis is performed using R and ade4 packages.

5. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 5), the LDA score is obtained by detecting the difference function between different components through a rank-sum test and performing the dimension reduction through Linear Discriminant Analysis (LDA) and evaluating the influence of different species.

6. The method for the automated typing of human intestinal flora according to claim 1, wherein in step 5), the intestine type is named as G plus numeral form.

7. The method for the automated typing of human intestinal flora according to claim 1,

and 5) finding out the significant biomarkers among different clusters by adopting an LEfSe analysis process.

8. The method for the automated typing of human intestinal flora based on the claims 1 to 7, wherein in step 5), the software package ggplot2 in R language is used to draw a boxplot.