CN113486954B

CN113486954B - Intestinal microecological differential bacteria classification processing method and intestinal health assessment method

Info

Publication number: CN113486954B
Application number: CN202110764854.7A
Authority: CN
Inventors: 陈晓春; 王小军; 覃涛
Original assignee: Guangxi Aisheng Life Technology Co ltd
Current assignee: Guangxi Aisheng Life Technology Co ltd
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2023-04-07
Anticipated expiration: 2041-07-06
Also published as: CN113486954A

Abstract

The invention discloses a classification processing method of intestinal microecological differential bacteria and an intestinal health assessment method, wherein the classification processing method comprises the following steps: step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG; step S11, converting the digitized characteristics of each sample from absolute abundance to relative abundance; step S12, dividing all samples in the sample set A into k groups of intestinal types according to the similarity of absolute abundance, marking the classified intestinal type set as AB, then separating a non-healthy group sample DG and a control group sample CG in the intestinal type set AB, marking the non-healthy group sample set as DG, and the control group sample set as CG; and step S13, calculating and sequencing the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types. The invention not only can effectively extract the sample characteristics with high dimension and large abundance variation range, but also has stronger anti-interference performance.

Description

Intestinal micro-ecological differential bacteria classification processing method and intestinal health assessment method

Technical Field

The invention relates to an intestinal microbial data analysis and processing method, in particular to an intestinal microecological differential bacterium classification and processing method and an intestinal health assessment method.

Background

In recent years, it has been found that the gut microbiota not only plays a role in cardiovascular and metabolic-related unhealthy, but also is an important environmental factor for several tumors, including colorectal, liver, and breast cancers, among others. With the development of genome technologies and metabonomics such as new-generation gene sequencing, the role of intestinal microbiota in tumorigenesis and development is concerned, and more animals or clinical researches show that the intestinal microbiota can become markers and treatment targets for non-healthy screening and prognosis prediction of intestinal health, liver cancer and the like.

On the other hand, the structure of the intestinal microbiota can show different changes with the age, and especially the change of the diet structure related to the age can change the biological diversity of the intestinal flora, determine the relative abundance of specific flora, cause the flora to be disordered, and have negative influence on the physiology of a host. The microbiota profile of aging individuals is not dominated by a particular species group, but rather by a reduced diversity of the gut flora, for example for patients with a high debilitating score, the faecal sample is significantly reduced in lactic acid bacteria, the ratio of prevotella to bacteroides/prevotella is reduced, while enterobacteriaceae, ruminococcus are increased, and there is a significant negative correlation between debilitating and gut flora diversity.

The data used for intestinal microecology analysis are often obtained by high throughput sequencing, described by a matrix structure. Each row of the matrix corresponds to a sample and the abundance of different flora contained in the sample, and all the sample data are combined together to form a matrix structure. The sum of the abundance of the flora in each sample is stable, and if the abundance of one component is decreased, the abundance of one or more other components may be increased.

In the existing references, most of the existing references focus on constructing an overall processing system and outputting analysis results, and the content is relatively homogeneous. Please refer to patent document CN105046094B, which constructs a processing system including analysis and storage of intestinal flora data, including a dynamic database for obtaining the latest detection parameters;

please refer to patent document CN107506582A, which constructs a health risk prediction system based on intestinal microorganisms, and focuses on the overall module construction and evaluation result format design;

please see patent publication No. CN108841974A, which evaluates maturity by comparing the similarity in composition of infant and maternal intestinal microorganisms;

please refer to patent document CN110144415A, which designs a method for predicting health and healthy immunity of introduced cows based on intestinal flora, and adopts conventional abundance analysis and diversity analysis;

please refer to patent document CN111161794A, which is used to evaluate the intestinal flora of a target object to obtain specific intestinal flora information.

Please refer to patent documents CN111462819A and CN112151118A, both of which focus on software automation of the analysis process of intestinal flora data, and the analysis contents mainly include abundance data, probiotic and pathogenic bacteria analysis, diversity analysis, and interpretation by combining with disease-related database.

In practical application, there are many methods for analyzing intestinal micro-ecological data, generally, an unhealthy group and a healthy group are constructed and contrasted, differential bacteria are obtained through comparative analysis, and then a model capable of predicting human unhealthy is constructed by applying supervised learning, and the difficulty lies in how to effectively extract sample characteristics with high dimension and large abundance variation range.

Disclosure of Invention

The invention aims to solve the technical problems of the prior art, and provides an intestinal micro-ecological difference bacterium classification processing method and an intestinal health assessment method which can solve the problems of wide distribution range, strong interference and the like of intestinal flora data, acquire key characteristics based on a polymerization classification algorithm and further improve the middle analysis process.

In order to solve the technical problems, the invention adopts the following technical scheme.

A method for classifying intestinal microecological differential bacteria comprises the following steps: step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of the basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing; s11, converting the digitized characteristics of each sample from absolute abundance to relative abundance, and screening out the absolute abundance characteristics of which the occurrence frequency is less than a preset value and the relative abundance is close to zero in the samples; step S12, dividing all samples in the sample set A into k groups of intestine types according to the similarity of absolute abundance, and recording the classified intestine type set as AB: { B ₁ ，B ₂ ,...，B _k And separating the samples DG in the non-healthy group from the samples CG in the control group in the intestinal type set AB, and recording the samples DG in the non-healthy group as the sample set: { D ₁ ,D ₂ ,...,D _k Control group sample set as CG: { C ₁ ,C ₂ ,...,C _k }; step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types: { F ₁ ,F ₂ ,...,F _i ,...,F _k }。

Preferably, in step S10, the features of n samples are arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU ₁ +OTU ₂ +...+OTU _m Where OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature.

Preferably, in step S11, if the number of samples with an absolute abundance characteristic in the matrix that is more than 0.01% among n samples is S, the frequency of occurrence of the flora is the ratio S/n.

Preferably, in the step S11, for each type of intestine type data set B in the intestine type set AB _i And screening out low abundance characteristic interference data, wherein the low abundance characteristic is defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%.

A method for assessing gut health, comprising: s20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal micro-ecological difference bacterium classification processing method; step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.

Preferably, in step S20, the process of constructing k machine learning models includes: step S200, inputting k groups of intestine type difference characteristics DF: { F ₁ ,F ₂ ,...,F _i ,...,F _k }, setting the ithThe difference of the group intestine types is characterized by F _i The residual characteristic quantity after screening is p, and the sample quantity is si; step S201, for a data set containing si samples and p features: selecting sm samples from si samples through random sampling, selecting t characteristics by using a difference characteristic screening method, establishing a decision tree by using the t characteristics aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree: { tree ₁ ,tree ₂ ,...,tree _k }。

Preferably, in step S21, the prediction process of the unhealthy probability includes: step S210, converting the absolute abundance of the flora in the input sample data into a relative abundance, and setting the input sample as x: [ R ] ₁ ,R ₂ ,...,R _i ,...,R _m ](ii) a Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity; step S212, judging the intestinal Bi to which the sample S belongs; step S213, the input sample x is predicted by using the decision tree corresponding to the intestinal Bi.

Preferably, in step S21, the prediction process of the unhealthy probability includes: and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.

In the intestinal micro-ecological difference bacterium classification processing method and the intestinal health assessment method, the steps of sample characteristic digitization processing, format conversion and filtration of digitized characteristics, sample similarity clustering and difference characteristic screening are sequentially carried out on the collected sample set A.

Drawings

FIG. 1 is a flow chart of the intestinal micro-ecological difference bacteria classification processing method of the present invention;

fig. 2 is a flow chart of the intestinal health assessment method according to the present invention.

Detailed Description

The invention is described in more detail below with reference to the figures and examples.

The invention discloses a method for classifying and processing intestinal microecological differential bacteria, please refer to fig. 1, which comprises the following steps:

step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing;

step S11, converting the digitized characteristics of each sample from absolute abundance into relative abundance, and screening out the absolute abundance characteristics of which the occurrence frequency is less than a preset value and the relative abundance is close to zero in the samples;

step S12, all samples in the sample set A are divided into k groups of intestine types according to the similarity of absolute abundance, and the classified intestine type set is marked as AB: { B ₁ ，B ₂ ,...，B _k And separating the samples DG of the non-healthy group from the samples CG of the control group in the intestinal type set AB, and recording the samples DG of the non-healthy group as: { D ₁ ,D ₂ ,...,D _k Control group sample set as CG: { C ₁ ,C ₂ ,...,C _k }；

Step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types ₁ ,F ₂ ,...,F _i ,...,F _k }。

In the method, for the collected sample set A, sample feature digitization processing, format conversion and filtering of digitization features, sample similarity clustering and difference feature screening are sequentially carried out, compared with the prior art, the method obtains key features based on a cluster classification algorithm, further improves the intermediate analysis process, can effectively extract sample features with high dimension and large abundance variation range, has stronger anti-interference performance, and better meets the application requirements.

For details of the implementation of steps S10 to S13, please refer to the following first to fourth embodiments.

Example one

In this embodiment, the step S10 mainly implements a digital processing process of the sample characteristics.

For the step S10, data a is collected, which includes n samples, including the non-healthy sample DG and the control group sample CG labeled by the expert, where the number of samples of the non-healthy sample DG is n1, the number of samples of the control group sample CG is n2, and it is noted that the values of n1 and n2 are close, and n = n1+ n2. The sample characteristics are composed of basic information of the sample and absolute abundance (OTU characteristics) of the intestinal flora obtained after 16S sequencing, and each sample comprises at most m characteristics.

In step S10, the features of n samples are arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU ₁ +OTU ₂ +...+OTU _m Where OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature. Note that there are m OTU eigenvalues per sample.

During the treatment, samples with non-health history and antibiotic administration history were filtered.

Example two

In this embodiment, the step S11 mainly implements format conversion and filtering functions of the digital feature.

For the step S11, the method further comprises a format conversion step and a low abundance feature screening step:

for the format conversion step, the flora relative abundance is the ratio of the abundance of a certain flora in a given sample to the sum of the abundances of all flora in that sample, so in said step S11, the absolute abundance is converted into a relative abundance according to the following steps:

setting abundance of a certain flora in the sample to R _ij The non-healthy group DG or the control group CG has n1 or n2 samples respectively, each sample has m characteristics, and then the jth sample isThe relative abundance characterized in the ith sample was:

based on the above calculation, the digitized features of each sample are converted from absolute abundance to relative abundance.

For the step of screening out the low abundance features, firstly, the calculation of the occurrence frequency of the flora is carried out, and in the step S11, if the number of samples with certain absolute abundance features in the matrix (composed of all samples and features) with the abundance of more than 0.01% in n samples is S, the occurrence frequency of the flora is the ratio S/n.

The specific screening means is as follows: in said step S11, for each type of bowel type data set B in the bowel type set AB _i And screening out low abundance characteristic interference data, wherein the low abundance characteristic is defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%. I.e., screening out OTU features that occur in the sample with a low frequency and a relative abundance close to zero.

After the treatment, the abundance characteristics of the intestinal flora are represented by relative abundance instead of absolute abundance, and the number of the characteristics is reduced from m to p, wherein p < m.

EXAMPLE III

In this embodiment, the step S12 is a sample similarity clustering process.

For said step S12, it involves a sample similarity clustering procedure: all samples in the sample A set are classified into a plurality of types according to the similarity of OTU characteristics, samples with high similarity are classified into the same type, and are classified into k types of intestinal types, and each type of intestinal type comprises a plurality of samples.

Further, in the step S12, a sample x: [ x ] is set ₁ ,x ₂ ,...,x _i ,...,x _m ]And sample y: [ y: ₁ ,y ₂ ,...,y _i ,...y _m ]the similarity sim of sample x and sample y is calculated using the following formula:

wherein, the known sample set a contains a large number of samples with the same structure as x or y, a data clustering method can be adopted, that is, the samples are divided into a plurality of groups by calculating the similarity between every two samples, the samples with high similarity are divided into the same group, and if the samples are allowed to be divided into k groups at most, a can be expressed as { B [ ] ₁ ，B ₂ ,...，B _k Each group comprises a plurality of samples with higher similarity among each other, and samples of different groups have lower similarity.

And setting a new sample s _ n, calculating the similarity between the new sample s _ n and samples in different groups in order to calculate which group the new sample s _ n belongs to, finally finding the sample q with the closest similarity, and attributing s _ n to the sample group where q belongs to.

The q features with the highest abundance in the sample can be considered (e.g., take q = 10) and then similarity and classification can be calculated. The classified intestinal type set was designated AB: { B ₁ ，B ₂ ,...，B _k }. The non-healthy group (DG) samples in the AB were then separated from the Control Group (CG) samples, such that the set of non-healthy group samples was designated as DG: { D ₁ ,D ₂ ,...,D _k And the control group sample set is denoted as CG: { C ₁ ,C ₂ ,...,C _k }。

Example four

In this embodiment, the step S13 mainly involves a process of screening for differential characteristics based on the intestinal type.

For said step S13, in the process of screening based on the difference characteristics of the intestinal type:

the combination of samples obtained after recombination according to the intestinal type category is: d ₁ |C ₁ ,D ₂ |C ₂ ,...,D _i |C _i ,...,D _k |C _k . Note D _i |C _i (1 ≦ i ≦ k) because they belong toThe same type of intestine. This allows similar types of intestines to be put together for differential characterization.

For non-healthy group D _i And control group C _i Comparing the flora characteristics, calculating the value difference of all characteristics in the two groups of samples and sequencing the values, thus obtaining the difference characteristic sequence. Finally, the corresponding difference characteristics of the k groups of intestinal types can be obtained and are marked as DF: { F ₁ ,F ₂ ,...,F _i ,...,F _k }。

By analyzing the relationship between the difference value of the difference feature sequence and the sample label category, t features (t ≦ q) with obvious advantages can be selected for machine learning modeling analysis of k groups of samples.

Based on the method described in the first to fourth embodiments, the present invention realizes the screening of the differential micro-ecological bacteria in the intestinal tract of the human body, and on this basis, the present invention further relates to a machine learning model for predicting the health status of the intestinal tract, specifically to an intestinal tract health assessment method, please refer to fig. 2, which includes:

s20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal microecological difference bacterium classification processing method;

step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.

For the step S20, the process of constructing k machine learning models includes:

first, there are k groups of intestinal types, and the difference in the i-th group is characterized by F _i And contains p features and si samples. Then F _i The corresponding table is as follows:

sample numbering	Feature 1	...	Characteristic j	...	Characteristic t
						Sample 1	Val ₁₁	...	Val _1j	...	Val _1t
...	...	...	...	...	...
						Sample j	Val _j1	...	Val _jj	...	Val _jt
...	...	...	...	...	...
						Sample si	Val _si1	...	Val _sij	...	val _sit

In the above table, the original sample contains m features, p features remain after filtering, and p is less than or equal to m; selecting q characteristics with abundance arranged in the front from p from high to low to calculate similarity, wherein q is less than or equal to p; and finally, obtaining t characteristics by utilizing difference analysis, wherein t is less than or equal to q.

The step S20 further includes:

step S200, inputting k groups of intestine type difference characteristics DF: { F ₁ ,F ₂ ,...,F _i ,...,F _k And (c) setting the difference characteristic of the i-th group of intestine types as F _i The residual characteristic quantity after screening out is p, and the sample quantity is si;

step S201, for a data set containing si samples and p features:

selecting sm samples from si samples through random sampling, selecting t characteristics by using a difference characteristic screening method, establishing a decision tree by using the t characteristics aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree: { tree ₁ ,tree ₂ ,...,tree _k }。

That is, for a data set containing si samples and p features: selecting sm samples from the si samples through random sampling, selecting t features by using a difference feature screening method, and establishing a decision tree for the selected samples by using the features. And repeating the sampling step k times to generate k decision trees and form a decision tree model. And for new unclassified data, the k decision tree models established above can be used for judging one by one, so that k judgment results can be obtained, and finally, the best judgment result is selected as the class to which the new sample belongs, and the result is output.

In practical applications, the above model is not limited to the decision tree, but is also applicable to other machine learning models such as linear regression, K-nearest neighbor or support vector machine methods.

For step S21, please refer to fig. 2, there are two ways to realize the non-health probability prediction:

in a first way, the prediction process of the non-health probability may be:

step S210, converting the absolute abundance of the flora in the input sample data into a relative abundance, and setting the input sample as x: [ R ] ₁ ,R ₂ ,...,R _i ,...,R _m ]；

Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity;

step S212, judging the intestinal Bi to which the sample S belongs;

step S213, the input sample x is predicted by using the decision tree corresponding to the intestinal Bi.

In the above steps, the unhealthy probability of the sample is predicted by searching which type of intestine S belongs to, and when S belongs to type of intestine k (0 & lt k & gt NC + 1), using a machine learning model of type k intestine.

In a second way, the prediction process of the non-health probability may be:

and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.

According to the method, similarity comparison is not carried out, a sample x is directly input into k machine learning models, k prediction non-health probabilities can be obtained, and the sample x with the highest probability is selected as the non-health probability of the sample and output.

It should be noted that, the above-mentioned contents of the present invention relate to a method for classifying and processing intestinal micro-ecological difference bacteria and a method for evaluating intestinal health, in practical applications, both of them can be used in combination, or can be taken out and flexibly applied according to needs, and the present invention does not limit the specific application manner, that is, the method for classifying and processing intestinal micro-ecological difference bacteria and the method for evaluating intestinal health in the present invention are both within the protection scope of the present invention.

In order to more clearly describe the technical solution of the present invention, the present invention provides the following specific detailed examples.

Example one

The embodiment relates to an evaluation method for identifying intestinal health based on human intestinal flora conditions, which comprises the following steps:

step S1, preparing group A samples, preprocessing the samples, deleting unqualified samples, and filtering out samples with unhealthy history and antibiotic taking history. And then, the flora characteristics with relative abundance of less than 0.01% in the samples with relative abundance of more than 90% are screened out.

Step S2, according to the formula

And &>

Calculating the similarity between every two samples in the group A, dividing the samples with higher similarity into one group, and classifying the samples into 5 groups of intestine types, namely { B } ₁ ,B ₂ ,B ₃ ,B ₄ ,B ₅ }。

For each type of bowel type data set B _i According to the formula

It is first converted to relative abundance.

Step S3, samples from 5 intestinal types were divided into non-healthy groups DG: { D ₁ ,D ₂ ,D ₃ ,D ₄ ,D ₅ Cg: { C } and control group CG ₁ ,C ₂ ,C ₃ ,C ₄ ,C ₅ And calculating and ordering the relative abundance difference of the two groups of samples. The differential flora of 5 groups of intestinal type data sets was obtained and is marked as DF: { F ₁ ,F ₂ ,F ₃ ,F ₄ ,F ₅ }。

Step S4, forOnly the differential bacterial characteristic DF: { F ] remained in each type of enterotype data set ₁ ,F ₂ ,F ₃ ,F ₄ ,F ₅ }. Then, a random sampling mode is adopted to select s _m And constructing k decision trees by using the samples to form a decision tree model. Thus, 5 sets of intestinal types together obtain 5 decision tree models, which are recorded as dcTree: { Tree1, tree2, tree3, tree4, tree5}.

Step S5, for the new sample input to be detected, firstly according to the formula

Converting the absolute abundance into relative abundance, comparing the relative abundance with samples in 5 intestinal type sets one by one, and calculating the similarity to obtain a sample s with the highest similarity to the input sample, if s belongs to B _i Intestinal type, the new sample can be classified as B _i Type analogous to intestine, use B _i And predicting the input sample by the decision tree model corresponding to the intestinal type.

The procedure for the specific test data is as follows:

1. sample specification and pretreatment, see fig. 1 for sample processing.

Sample a, containing both non-healthy and control classifications, is known. The sample is characterized by a relative abundance value of the flora in the intestinal environment obtained after 16s sequencing. Thus, each sample is equivalent to a row vector in a matrix, and the composition is: number + sampling address + non-healthy type + antibiotic type + OTU ₁ +OTU ₂ +...+OTU _Len The OTU is the relative abundance of a certain flora in a sample, and each sample has 1000 OTU characteristic values.

OTU for a certain flora in A sample _j The relative abundance in 90% of the samples is less than 0.01%, and the OTU is added _j Feature filtering, which filters out a large number of features with relative abundance close to 0. The filtration leaves 300 OTU characteristic values, as in the following table:

index	group	otu1	otu5	otu6	otu9	…
							sample1	Control	0.012576539	0	0.000103199	0	…
sample2	Control	0	0	0.0000903276	0	…
							sample3	Control	0.007804594	0	0	0	…
sample4	Control	0.001026461	0	0.000306605	0	…
							sample5	Control	0.000414224	0	0	0	…
sample6	Case	0	0	0.000216911	0	…
							sample7	Case	0.001686275	0	0.002457516	0	…
sample8	Case	0	0	0.0000467978	0.000103107	…
							sample9	Control	0.000212123	0	0	0.00007152154	…
sample10	Control	0.002755471	0	0	0	…
							…	…	…	…	…	…	…

2. grouping samples: calculating the similarity of every two samples, comparing the values, putting the samples with close similarity together to form an intestine type, and dividing all the samples into 5 intestine types, and marking as { B } ₁ ，B ₂ ，...，B ₅ And a proportion of non-healthy and control samples is assigned to each bowel type. The similarity values are as follows:

	Sample1	Sample2	Sample3	Sample4	Sample5	Sample6	Sample7	…
									Sample1	0	0.182662	0.459110	0.487837	0.648875	0.608471	0.679105	…
Sample2	0.182662	0	0.376313	0.390729	0.561831	0.520588	0.593454	…
									Sample3	0.459110	0.376313	0	0.328543	0.475267	0.458186	0.512884	…
Sample4	0.487837	0.390729	0.328543	0	0.210214	0.202143	0.315686	…
									Sample5	0.648875	0.561831	0.475267	0.210214	0	0.098052	0.239273	…
Sample6	0.608471	0.520588	0.458186	0.202143	0.098052	0	0.252005	…
									Sample7	0.679105	0.593454	0.512884	0.315686	0.239273	0.252005	0	…
…	…	…	…	…	…	…	…	…

3. screening differential bacteria: defining a bowel-type data set B _j Evaluation of the parameter dR for the difference in relative abundance of CG in non-healthy group DG or control group _DC，j Corresponding to the formula: dR _DC，j ＝log ₂ (AR _D，j /AR _C，j ) Wherein AR _D，j And AR _C，j Respectively representing characteristics j in the unhealthy sample and the control sampleThe sum of the relative abundances of the present. And sorting according to the absolute value of the calculation result, and selecting the differential bacteria. 20, 18, 19, 18 and 19 different bacteria are selected from the 5 intestinal types.

4. And constructing a decision tree model.

From B to B _j In the colon-like data set, samples are randomly sampled and selected. E.g. when j =1, with B ₁ Samples of gut type were used as training and validation sets, characterized by 20 differential flora. The corresponding table is as follows:

sample numbering

Otu15

...

Otu109

...

Otu163

Grouping

...

Sample15

Val _j1

...

Val _jj

...

Val _jp

Case

...

Sample1000

Val _si1

...

Val _sij

...

val _sip

Control

Firstly, a decision tree initial model is built, then the model is trained by using the data in the table above, and the decision tree model obtained after training can be used for judging whether the sample belongs to an unhealthy group or a healthy group. Because 5 types of intestinal data are included, 5 decision tree models are obtained after training and are respectively recorded as: tree1, tree2, tree3, tree4, tree5.

5. Non-healthy calculation.

For a new sample input to be tested, 300 OTU features are included.

First according to the formula

Converting the absolute abundance into relative abundance, comparing the relative abundance with samples in 5 intestinal type sets one by one, and calculating the similarity to obtain a sample s with the highest similarity to the input sample, if s belongs to B _i Intestinal type, the new sample can be classified as B _i Type analogous to intestine, use B _i And predicting the input sample by the decision tree model corresponding to the intestinal type. Here, it is assumed that s belongs to B ₃ And judging the new sample input by adopting a tree3 decision tree model, and finally determining whether the input belongs to the type of 'unhealthy' or 'healthy'.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents or improvements made within the technical scope of the present invention should be included in the scope of the present invention.

Claims

1. A method for classifying intestinal microecological differential bacteria is characterized by comprising the following steps:

step S12, all samples in the sample set A are divided into k groups of intestine types according to the similarity of absolute abundance, and the classified intestine type set is marked as AB: { B ₁ ，B ₂ ,...，B _k And then the samples DG and control groups of the non-healthy group in the intestinal type set AB are sampledThe CG is divided and the non-healthy group sample set is denoted as DG: { D ₁ ,D ₂ ,...,D _k Control group sample set as CG: { C ₁ ,C ₂ ,...,C _k }；

2. The intestinal micro-ecological differential bacteria classification processing method according to claim 1, wherein in step S10, the features of n samples are firstly arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU ₁ +OTU ₂ +...+OTU _m Where OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature.

3. The intestinal differential micro-ecological bacteria classification processing method according to claim 2, characterized in that in step S11, the absolute abundance is converted into the relative abundance according to the following steps:

setting abundance of a certain flora in the sample to R _ij Then the relative abundance of the jth feature in the ith sample is:

4. the intestinal differential micro-ecological bacteria classification processing method of claim 3, wherein in step S11, if the number of samples with certain absolute abundance features having abundance of more than 0.01% among n samples in the matrix is S, the occurrence frequency of the flora is the ratio S/n.

5. The intestinal micro-ecological difference of claim 4The bacteria classification processing method is characterized in that in the step S11, each type of intestinal type data set B in the intestinal type set AB _i And screening out interference data of low abundance characteristics, wherein the low abundance characteristics are defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%.

6. The method according to claim 1, wherein the step S12 is performed by using a sample x [ x: [ x ] ₁ ,x ₂ ,...,x _i ,...,x _m ]And sample y: [ y: ₁ ,y ₂ ,...,y _i ,...y _m ]the similarity sim of sample x and sample y is calculated using the following formula:

7. a method for assessing gut health, comprising:

step S20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal microecological difference bacterium classification processing method of claim 1;

8. The method for evaluating intestinal health according to claim 7, wherein in the step S20, the process of constructing k machine learning models includes:

step S200, inputting k groups of intestine type difference characteristics DF: { F ₁ ,F ₂ ,...,F _i ,...,F _k And (c) setting the difference characteristic of the i-th group of intestine types as F _i The residual characteristic quantity after screening is p, and the sample quantity is si;

step S201, for a data set containing si samples and p features:

selecting sm samples from si samples by random sampling, selecting t characteristics by using a differential characteristic screening method, establishing a decision tree by using the t characteristics aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree: { tree ₁ ,tree ₂ ,...,tree _k }。

9. The method for evaluating intestinal health according to claim 7, wherein the step S21 of predicting the non-health probability comprises:

step S212, judging the intestinal Bi of the sample S;

step S213, predicting the input sample x by using the decision tree corresponding to the intestines-shaped Bi.

10. The method for evaluating intestinal health according to claim 7, wherein the step S21 of predicting the non-health probability comprises: