CN113486954B - Intestinal microecological differential bacteria classification processing method and intestinal health assessment method - Google Patents

Intestinal microecological differential bacteria classification processing method and intestinal health assessment method Download PDF

Info

Publication number
CN113486954B
CN113486954B CN202110764854.7A CN202110764854A CN113486954B CN 113486954 B CN113486954 B CN 113486954B CN 202110764854 A CN202110764854 A CN 202110764854A CN 113486954 B CN113486954 B CN 113486954B
Authority
CN
China
Prior art keywords
sample
intestinal
samples
abundance
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110764854.7A
Other languages
Chinese (zh)
Other versions
CN113486954A (en
Inventor
陈晓春
王小军
覃涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Aisheng Life Technology Co ltd
Original Assignee
Guangxi Aisheng Life Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Aisheng Life Technology Co ltd filed Critical Guangxi Aisheng Life Technology Co ltd
Priority to CN202110764854.7A priority Critical patent/CN113486954B/en
Publication of CN113486954A publication Critical patent/CN113486954A/en
Application granted granted Critical
Publication of CN113486954B publication Critical patent/CN113486954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a classification processing method of intestinal microecological differential bacteria and an intestinal health assessment method, wherein the classification processing method comprises the following steps: step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG; step S11, converting the digitized characteristics of each sample from absolute abundance to relative abundance; step S12, dividing all samples in the sample set A into k groups of intestinal types according to the similarity of absolute abundance, marking the classified intestinal type set as AB, then separating a non-healthy group sample DG and a control group sample CG in the intestinal type set AB, marking the non-healthy group sample set as DG, and the control group sample set as CG; and step S13, calculating and sequencing the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types. The invention not only can effectively extract the sample characteristics with high dimension and large abundance variation range, but also has stronger anti-interference performance.

Description

Intestinal micro-ecological differential bacteria classification processing method and intestinal health assessment method
Technical Field
The invention relates to an intestinal microbial data analysis and processing method, in particular to an intestinal microecological differential bacterium classification and processing method and an intestinal health assessment method.
Background
In recent years, it has been found that the gut microbiota not only plays a role in cardiovascular and metabolic-related unhealthy, but also is an important environmental factor for several tumors, including colorectal, liver, and breast cancers, among others. With the development of genome technologies and metabonomics such as new-generation gene sequencing, the role of intestinal microbiota in tumorigenesis and development is concerned, and more animals or clinical researches show that the intestinal microbiota can become markers and treatment targets for non-healthy screening and prognosis prediction of intestinal health, liver cancer and the like.
On the other hand, the structure of the intestinal microbiota can show different changes with the age, and especially the change of the diet structure related to the age can change the biological diversity of the intestinal flora, determine the relative abundance of specific flora, cause the flora to be disordered, and have negative influence on the physiology of a host. The microbiota profile of aging individuals is not dominated by a particular species group, but rather by a reduced diversity of the gut flora, for example for patients with a high debilitating score, the faecal sample is significantly reduced in lactic acid bacteria, the ratio of prevotella to bacteroides/prevotella is reduced, while enterobacteriaceae, ruminococcus are increased, and there is a significant negative correlation between debilitating and gut flora diversity.
The data used for intestinal microecology analysis are often obtained by high throughput sequencing, described by a matrix structure. Each row of the matrix corresponds to a sample and the abundance of different flora contained in the sample, and all the sample data are combined together to form a matrix structure. The sum of the abundance of the flora in each sample is stable, and if the abundance of one component is decreased, the abundance of one or more other components may be increased.
In the existing references, most of the existing references focus on constructing an overall processing system and outputting analysis results, and the content is relatively homogeneous. Please refer to patent document CN105046094B, which constructs a processing system including analysis and storage of intestinal flora data, including a dynamic database for obtaining the latest detection parameters;
please refer to patent document CN107506582A, which constructs a health risk prediction system based on intestinal microorganisms, and focuses on the overall module construction and evaluation result format design;
please see patent publication No. CN108841974A, which evaluates maturity by comparing the similarity in composition of infant and maternal intestinal microorganisms;
please refer to patent document CN110144415A, which designs a method for predicting health and healthy immunity of introduced cows based on intestinal flora, and adopts conventional abundance analysis and diversity analysis;
please refer to patent document CN111161794A, which is used to evaluate the intestinal flora of a target object to obtain specific intestinal flora information.
Please refer to patent documents CN111462819A and CN112151118A, both of which focus on software automation of the analysis process of intestinal flora data, and the analysis contents mainly include abundance data, probiotic and pathogenic bacteria analysis, diversity analysis, and interpretation by combining with disease-related database.
In practical application, there are many methods for analyzing intestinal micro-ecological data, generally, an unhealthy group and a healthy group are constructed and contrasted, differential bacteria are obtained through comparative analysis, and then a model capable of predicting human unhealthy is constructed by applying supervised learning, and the difficulty lies in how to effectively extract sample characteristics with high dimension and large abundance variation range.
Disclosure of Invention
The invention aims to solve the technical problems of the prior art, and provides an intestinal micro-ecological difference bacterium classification processing method and an intestinal health assessment method which can solve the problems of wide distribution range, strong interference and the like of intestinal flora data, acquire key characteristics based on a polymerization classification algorithm and further improve the middle analysis process.
In order to solve the technical problems, the invention adopts the following technical scheme.
A method for classifying intestinal microecological differential bacteria comprises the following steps: step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of the basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing; s11, converting the digitized characteristics of each sample from absolute abundance to relative abundance, and screening out the absolute abundance characteristics of which the occurrence frequency is less than a preset value and the relative abundance is close to zero in the samples; step S12, dividing all samples in the sample set A into k groups of intestine types according to the similarity of absolute abundance, and recording the classified intestine type set as AB: { B 1 ,B 2 ,...,B k And separating the samples DG in the non-healthy group from the samples CG in the control group in the intestinal type set AB, and recording the samples DG in the non-healthy group as the sample set: { D 1 ,D 2 ,...,D k Control group sample set as CG: { C 1 ,C 2 ,...,C k }; step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types: { F 1 ,F 2 ,...,F i ,...,F k }。
Preferably, in step S10, the features of n samples are arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU 1 +OTU 2 +...+OTU m Where OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature.
Preferably, in step S11, if the number of samples with an absolute abundance characteristic in the matrix that is more than 0.01% among n samples is S, the frequency of occurrence of the flora is the ratio S/n.
Preferably, in the step S11, for each type of intestine type data set B in the intestine type set AB i And screening out low abundance characteristic interference data, wherein the low abundance characteristic is defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%.
A method for assessing gut health, comprising: s20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal micro-ecological difference bacterium classification processing method; step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.
Preferably, in step S20, the process of constructing k machine learning models includes: step S200, inputting k groups of intestine type difference characteristics DF: { F 1 ,F 2 ,...,F i ,...,F k }, setting the ithThe difference of the group intestine types is characterized by F i The residual characteristic quantity after screening is p, and the sample quantity is si; step S201, for a data set containing si samples and p features: selecting sm samples from si samples through random sampling, selecting t characteristics by using a difference characteristic screening method, establishing a decision tree by using the t characteristics aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree: { tree 1 ,tree 2 ,...,tree k }。
Preferably, in step S21, the prediction process of the unhealthy probability includes: step S210, converting the absolute abundance of the flora in the input sample data into a relative abundance, and setting the input sample as x: [ R ] 1 ,R 2 ,...,R i ,...,R m ](ii) a Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity; step S212, judging the intestinal Bi to which the sample S belongs; step S213, the input sample x is predicted by using the decision tree corresponding to the intestinal Bi.
Preferably, in step S21, the prediction process of the unhealthy probability includes: and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.
In the intestinal micro-ecological difference bacterium classification processing method and the intestinal health assessment method, the steps of sample characteristic digitization processing, format conversion and filtration of digitized characteristics, sample similarity clustering and difference characteristic screening are sequentially carried out on the collected sample set A.
Drawings
FIG. 1 is a flow chart of the intestinal micro-ecological difference bacteria classification processing method of the present invention;
fig. 2 is a flow chart of the intestinal health assessment method according to the present invention.
Detailed Description
The invention is described in more detail below with reference to the figures and examples.
The invention discloses a method for classifying and processing intestinal microecological differential bacteria, please refer to fig. 1, which comprises the following steps:
step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing;
step S11, converting the digitized characteristics of each sample from absolute abundance into relative abundance, and screening out the absolute abundance characteristics of which the occurrence frequency is less than a preset value and the relative abundance is close to zero in the samples;
step S12, all samples in the sample set A are divided into k groups of intestine types according to the similarity of absolute abundance, and the classified intestine type set is marked as AB: { B 1 ,B 2 ,...,B k And separating the samples DG of the non-healthy group from the samples CG of the control group in the intestinal type set AB, and recording the samples DG of the non-healthy group as: { D 1 ,D 2 ,...,D k Control group sample set as CG: { C 1 ,C 2 ,...,C k };
Step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types 1 ,F 2 ,...,F i ,...,F k }。
In the method, for the collected sample set A, sample feature digitization processing, format conversion and filtering of digitization features, sample similarity clustering and difference feature screening are sequentially carried out, compared with the prior art, the method obtains key features based on a cluster classification algorithm, further improves the intermediate analysis process, can effectively extract sample features with high dimension and large abundance variation range, has stronger anti-interference performance, and better meets the application requirements.
For details of the implementation of steps S10 to S13, please refer to the following first to fourth embodiments.
Example one
In this embodiment, the step S10 mainly implements a digital processing process of the sample characteristics.
For the step S10, data a is collected, which includes n samples, including the non-healthy sample DG and the control group sample CG labeled by the expert, where the number of samples of the non-healthy sample DG is n1, the number of samples of the control group sample CG is n2, and it is noted that the values of n1 and n2 are close, and n = n1+ n2. The sample characteristics are composed of basic information of the sample and absolute abundance (OTU characteristics) of the intestinal flora obtained after 16S sequencing, and each sample comprises at most m characteristics.
In step S10, the features of n samples are arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU 1 +OTU 2 +...+OTU m Where OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature. Note that there are m OTU eigenvalues per sample.
During the treatment, samples with non-health history and antibiotic administration history were filtered.
Example two
In this embodiment, the step S11 mainly implements format conversion and filtering functions of the digital feature.
For the step S11, the method further comprises a format conversion step and a low abundance feature screening step:
for the format conversion step, the flora relative abundance is the ratio of the abundance of a certain flora in a given sample to the sum of the abundances of all flora in that sample, so in said step S11, the absolute abundance is converted into a relative abundance according to the following steps:
setting abundance of a certain flora in the sample to R ij The non-healthy group DG or the control group CG has n1 or n2 samples respectively, each sample has m characteristics, and then the jth sample isThe relative abundance characterized in the ith sample was:
Figure GDA0004061934540000071
based on the above calculation, the digitized features of each sample are converted from absolute abundance to relative abundance.
For the step of screening out the low abundance features, firstly, the calculation of the occurrence frequency of the flora is carried out, and in the step S11, if the number of samples with certain absolute abundance features in the matrix (composed of all samples and features) with the abundance of more than 0.01% in n samples is S, the occurrence frequency of the flora is the ratio S/n.
The specific screening means is as follows: in said step S11, for each type of bowel type data set B in the bowel type set AB i And screening out low abundance characteristic interference data, wherein the low abundance characteristic is defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%. I.e., screening out OTU features that occur in the sample with a low frequency and a relative abundance close to zero.
After the treatment, the abundance characteristics of the intestinal flora are represented by relative abundance instead of absolute abundance, and the number of the characteristics is reduced from m to p, wherein p < m.
EXAMPLE III
In this embodiment, the step S12 is a sample similarity clustering process.
For said step S12, it involves a sample similarity clustering procedure: all samples in the sample A set are classified into a plurality of types according to the similarity of OTU characteristics, samples with high similarity are classified into the same type, and are classified into k types of intestinal types, and each type of intestinal type comprises a plurality of samples.
Further, in the step S12, a sample x: [ x ] is set 1 ,x 2 ,...,x i ,...,x m ]And sample y: [ y: 1 ,y 2 ,...,y i ,...y m ]the similarity sim of sample x and sample y is calculated using the following formula:
Figure GDA0004061934540000081
Figure GDA0004061934540000082
wherein, the known sample set a contains a large number of samples with the same structure as x or y, a data clustering method can be adopted, that is, the samples are divided into a plurality of groups by calculating the similarity between every two samples, the samples with high similarity are divided into the same group, and if the samples are allowed to be divided into k groups at most, a can be expressed as { B [ ] 1 ,B 2 ,...,B k Each group comprises a plurality of samples with higher similarity among each other, and samples of different groups have lower similarity.
And setting a new sample s _ n, calculating the similarity between the new sample s _ n and samples in different groups in order to calculate which group the new sample s _ n belongs to, finally finding the sample q with the closest similarity, and attributing s _ n to the sample group where q belongs to.
The q features with the highest abundance in the sample can be considered (e.g., take q = 10) and then similarity and classification can be calculated. The classified intestinal type set was designated AB: { B 1 ,B 2 ,...,B k }. The non-healthy group (DG) samples in the AB were then separated from the Control Group (CG) samples, such that the set of non-healthy group samples was designated as DG: { D 1 ,D 2 ,...,D k And the control group sample set is denoted as CG: { C 1 ,C 2 ,...,C k }。
Example four
In this embodiment, the step S13 mainly involves a process of screening for differential characteristics based on the intestinal type.
For said step S13, in the process of screening based on the difference characteristics of the intestinal type:
the combination of samples obtained after recombination according to the intestinal type category is: d 1 |C 1 ,D 2 |C 2 ,...,D i |C i ,...,D k |C k . Note D i |C i (1 ≦ i ≦ k) because they belong toThe same type of intestine. This allows similar types of intestines to be put together for differential characterization.
For non-healthy group D i And control group C i Comparing the flora characteristics, calculating the value difference of all characteristics in the two groups of samples and sequencing the values, thus obtaining the difference characteristic sequence. Finally, the corresponding difference characteristics of the k groups of intestinal types can be obtained and are marked as DF: { F 1 ,F 2 ,...,F i ,...,F k }。
By analyzing the relationship between the difference value of the difference feature sequence and the sample label category, t features (t ≦ q) with obvious advantages can be selected for machine learning modeling analysis of k groups of samples.
Based on the method described in the first to fourth embodiments, the present invention realizes the screening of the differential micro-ecological bacteria in the intestinal tract of the human body, and on this basis, the present invention further relates to a machine learning model for predicting the health status of the intestinal tract, specifically to an intestinal tract health assessment method, please refer to fig. 2, which includes:
s20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal microecological difference bacterium classification processing method;
step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.
For the step S20, the process of constructing k machine learning models includes:
first, there are k groups of intestinal types, and the difference in the i-th group is characterized by F i And contains p features and si samples. Then F i The corresponding table is as follows:
sample numbering Feature 1 ... Characteristic j ... Characteristic t
Sample 1 Val 11 ... Val 1j ... Val 1t
... ... ... ... ... ...
Sample j Val j1 ... Val jj ... Val jt
... ... ... ... ... ...
Sample si Val si1 ... Val sij ... val sit
In the above table, the original sample contains m features, p features remain after filtering, and p is less than or equal to m; selecting q characteristics with abundance arranged in the front from p from high to low to calculate similarity, wherein q is less than or equal to p; and finally, obtaining t characteristics by utilizing difference analysis, wherein t is less than or equal to q.
The step S20 further includes:
step S200, inputting k groups of intestine type difference characteristics DF: { F 1 ,F 2 ,...,F i ,...,F k And (c) setting the difference characteristic of the i-th group of intestine types as F i The residual characteristic quantity after screening out is p, and the sample quantity is si;
step S201, for a data set containing si samples and p features:
selecting sm samples from si samples through random sampling, selecting t characteristics by using a difference characteristic screening method, establishing a decision tree by using the t characteristics aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree: { tree 1 ,tree 2 ,...,tree k }。
That is, for a data set containing si samples and p features: selecting sm samples from the si samples through random sampling, selecting t features by using a difference feature screening method, and establishing a decision tree for the selected samples by using the features. And repeating the sampling step k times to generate k decision trees and form a decision tree model. And for new unclassified data, the k decision tree models established above can be used for judging one by one, so that k judgment results can be obtained, and finally, the best judgment result is selected as the class to which the new sample belongs, and the result is output.
In practical applications, the above model is not limited to the decision tree, but is also applicable to other machine learning models such as linear regression, K-nearest neighbor or support vector machine methods.
For step S21, please refer to fig. 2, there are two ways to realize the non-health probability prediction:
in a first way, the prediction process of the non-health probability may be:
step S210, converting the absolute abundance of the flora in the input sample data into a relative abundance, and setting the input sample as x: [ R ] 1 ,R 2 ,...,R i ,...,R m ];
Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity;
step S212, judging the intestinal Bi to which the sample S belongs;
step S213, the input sample x is predicted by using the decision tree corresponding to the intestinal Bi.
In the above steps, the unhealthy probability of the sample is predicted by searching which type of intestine S belongs to, and when S belongs to type of intestine k (0 & lt k & gt NC + 1), using a machine learning model of type k intestine.
In a second way, the prediction process of the non-health probability may be:
and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.
According to the method, similarity comparison is not carried out, a sample x is directly input into k machine learning models, k prediction non-health probabilities can be obtained, and the sample x with the highest probability is selected as the non-health probability of the sample and output.
It should be noted that, the above-mentioned contents of the present invention relate to a method for classifying and processing intestinal micro-ecological difference bacteria and a method for evaluating intestinal health, in practical applications, both of them can be used in combination, or can be taken out and flexibly applied according to needs, and the present invention does not limit the specific application manner, that is, the method for classifying and processing intestinal micro-ecological difference bacteria and the method for evaluating intestinal health in the present invention are both within the protection scope of the present invention.
In order to more clearly describe the technical solution of the present invention, the present invention provides the following specific detailed examples.
Example one
The embodiment relates to an evaluation method for identifying intestinal health based on human intestinal flora conditions, which comprises the following steps:
step S1, preparing group A samples, preprocessing the samples, deleting unqualified samples, and filtering out samples with unhealthy history and antibiotic taking history. And then, the flora characteristics with relative abundance of less than 0.01% in the samples with relative abundance of more than 90% are screened out.
Step S2, according to the formula
Figure GDA0004061934540000111
And &>
Figure GDA0004061934540000112
Calculating the similarity between every two samples in the group A, dividing the samples with higher similarity into one group, and classifying the samples into 5 groups of intestine types, namely { B } 1 ,B 2 ,B 3 ,B 4 ,B 5 }。
For each type of bowel type data set B i According to the formula
Figure GDA0004061934540000121
It is first converted to relative abundance.
Step S3, samples from 5 intestinal types were divided into non-healthy groups DG: { D 1 ,D 2 ,D 3 ,D 4 ,D 5 Cg: { C } and control group CG 1 ,C 2 ,C 3 ,C 4 ,C 5 And calculating and ordering the relative abundance difference of the two groups of samples. The differential flora of 5 groups of intestinal type data sets was obtained and is marked as DF: { F 1 ,F 2 ,F 3 ,F 4 ,F 5 }。
Step S4, forOnly the differential bacterial characteristic DF: { F ] remained in each type of enterotype data set 1 ,F 2 ,F 3 ,F 4 ,F 5 }. Then, a random sampling mode is adopted to select s m And constructing k decision trees by using the samples to form a decision tree model. Thus, 5 sets of intestinal types together obtain 5 decision tree models, which are recorded as dcTree: { Tree1, tree2, tree3, tree4, tree5}.
Step S5, for the new sample input to be detected, firstly according to the formula
Figure GDA0004061934540000122
Converting the absolute abundance into relative abundance, comparing the relative abundance with samples in 5 intestinal type sets one by one, and calculating the similarity to obtain a sample s with the highest similarity to the input sample, if s belongs to B i Intestinal type, the new sample can be classified as B i Type analogous to intestine, use B i And predicting the input sample by the decision tree model corresponding to the intestinal type.
The procedure for the specific test data is as follows:
1. sample specification and pretreatment, see fig. 1 for sample processing.
Sample a, containing both non-healthy and control classifications, is known. The sample is characterized by a relative abundance value of the flora in the intestinal environment obtained after 16s sequencing. Thus, each sample is equivalent to a row vector in a matrix, and the composition is: number + sampling address + non-healthy type + antibiotic type + OTU 1 +OTU 2 +...+OTU Len The OTU is the relative abundance of a certain flora in a sample, and each sample has 1000 OTU characteristic values.
OTU for a certain flora in A sample j The relative abundance in 90% of the samples is less than 0.01%, and the OTU is added j Feature filtering, which filters out a large number of features with relative abundance close to 0. The filtration leaves 300 OTU characteristic values, as in the following table:
index group otu1 otu5 otu6 otu9
sample1 Control 0.012576539 0 0.000103199 0
sample2 Control 0 0 0.0000903276 0
sample3 Control 0.007804594 0 0 0
sample4 Control 0.001026461 0 0.000306605 0
sample5 Control 0.000414224 0 0 0
sample6 Case 0 0 0.000216911 0
sample7 Case 0.001686275 0 0.002457516 0
sample8 Case 0 0 0.0000467978 0.000103107
sample9 Control 0.000212123 0 0 0.00007152154
sample10 Control 0.002755471 0 0 0
2. grouping samples: calculating the similarity of every two samples, comparing the values, putting the samples with close similarity together to form an intestine type, and dividing all the samples into 5 intestine types, and marking as { B } 1 ,B 2 ,...,B 5 And a proportion of non-healthy and control samples is assigned to each bowel type. The similarity values are as follows:
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7
Sample1 0 0.182662 0.459110 0.487837 0.648875 0.608471 0.679105
Sample2 0.182662 0 0.376313 0.390729 0.561831 0.520588 0.593454
Sample3 0.459110 0.376313 0 0.328543 0.475267 0.458186 0.512884
Sample4 0.487837 0.390729 0.328543 0 0.210214 0.202143 0.315686
Sample5 0.648875 0.561831 0.475267 0.210214 0 0.098052 0.239273
Sample6 0.608471 0.520588 0.458186 0.202143 0.098052 0 0.252005
Sample7 0.679105 0.593454 0.512884 0.315686 0.239273 0.252005 0
3. screening differential bacteria: defining a bowel-type data set B j Evaluation of the parameter dR for the difference in relative abundance of CG in non-healthy group DG or control group DC,j Corresponding to the formula: dR DC,j =log 2 (AR D,j /AR C,j ) Wherein AR D,j And AR C,j Respectively representing characteristics j in the unhealthy sample and the control sampleThe sum of the relative abundances of the present. And sorting according to the absolute value of the calculation result, and selecting the differential bacteria. 20, 18, 19, 18 and 19 different bacteria are selected from the 5 intestinal types.
Figure GDA0004061934540000131
Figure GDA0004061934540000141
4. And constructing a decision tree model.
From B to B j In the colon-like data set, samples are randomly sampled and selected. E.g. when j =1, with B 1 Samples of gut type were used as training and validation sets, characterized by 20 differential flora. The corresponding table is as follows:
sample numbering Otu15 ... Otu109 ... Otu163 Grouping
... ... ... ... ... ... ...
Sample15 Val j1 ... Val jj ... Val jp Case
... ... ... ... ... ...
Sample1000 Val si1 ... Val sij ... val sip Control
Firstly, a decision tree initial model is built, then the model is trained by using the data in the table above, and the decision tree model obtained after training can be used for judging whether the sample belongs to an unhealthy group or a healthy group. Because 5 types of intestinal data are included, 5 decision tree models are obtained after training and are respectively recorded as: tree1, tree2, tree3, tree4, tree5.
5. Non-healthy calculation.
For a new sample input to be tested, 300 OTU features are included.
First according to the formula
Figure GDA0004061934540000142
Converting the absolute abundance into relative abundance, comparing the relative abundance with samples in 5 intestinal type sets one by one, and calculating the similarity to obtain a sample s with the highest similarity to the input sample, if s belongs to B i Intestinal type, the new sample can be classified as B i Type analogous to intestine, use B i And predicting the input sample by the decision tree model corresponding to the intestinal type. Here, it is assumed that s belongs to B 3 And judging the new sample input by adopting a tree3 decision tree model, and finally determining whether the input belongs to the type of 'unhealthy' or 'healthy'.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents or improvements made within the technical scope of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for classifying intestinal microecological differential bacteria is characterized by comprising the following steps:
step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing;
step S11, converting the digitized characteristics of each sample from absolute abundance into relative abundance, and screening out the absolute abundance characteristics of which the occurrence frequency is less than a preset value and the relative abundance is close to zero in the samples;
step S12, all samples in the sample set A are divided into k groups of intestine types according to the similarity of absolute abundance, and the classified intestine type set is marked as AB: { B 1 ,B 2 ,...,B k And then the samples DG and control groups of the non-healthy group in the intestinal type set AB are sampledThe CG is divided and the non-healthy group sample set is denoted as DG: { D 1 ,D 2 ,...,D k Control group sample set as CG: { C 1 ,C 2 ,...,C k };
Step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types 1 ,F 2 ,...,F i ,...,F k }。
2. The intestinal micro-ecological differential bacteria classification processing method according to claim 1, wherein in step S10, the features of n samples are firstly arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU 1 +OTU 2 +...+OTU m Where OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature.
3. The intestinal differential micro-ecological bacteria classification processing method according to claim 2, characterized in that in step S11, the absolute abundance is converted into the relative abundance according to the following steps:
setting abundance of a certain flora in the sample to R ij Then the relative abundance of the jth feature in the ith sample is:
Figure FDA0003150324900000011
4. the intestinal differential micro-ecological bacteria classification processing method of claim 3, wherein in step S11, if the number of samples with certain absolute abundance features having abundance of more than 0.01% among n samples in the matrix is S, the occurrence frequency of the flora is the ratio S/n.
5. The intestinal micro-ecological difference of claim 4The bacteria classification processing method is characterized in that in the step S11, each type of intestinal type data set B in the intestinal type set AB i And screening out interference data of low abundance characteristics, wherein the low abundance characteristics are defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%.
6. The method according to claim 1, wherein the step S12 is performed by using a sample x [ x: [ x ] 1 ,x 2 ,...,x i ,...,x m ]And sample y: [ y: 1 ,y 2 ,...,y i ,...y m ]the similarity sim of sample x and sample y is calculated using the following formula:
Figure FDA0003150324900000021
Figure FDA0003150324900000022
7. a method for assessing gut health, comprising:
step S20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal microecological difference bacterium classification processing method of claim 1;
step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.
8. The method for evaluating intestinal health according to claim 7, wherein in the step S20, the process of constructing k machine learning models includes:
step S200, inputting k groups of intestine type difference characteristics DF: { F 1 ,F 2 ,...,F i ,...,F k And (c) setting the difference characteristic of the i-th group of intestine types as F i The residual characteristic quantity after screening is p, and the sample quantity is si;
step S201, for a data set containing si samples and p features:
selecting sm samples from si samples by random sampling, selecting t characteristics by using a differential characteristic screening method, establishing a decision tree by using the t characteristics aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree: { tree 1 ,tree 2 ,...,tree k }。
9. The method for evaluating intestinal health according to claim 7, wherein the step S21 of predicting the non-health probability comprises:
step S210, converting the absolute abundance of the flora in the input sample data into a relative abundance, and setting the input sample as x: [ R ] 1 ,R 2 ,...,R i ,...,R m ];
Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity;
step S212, judging the intestinal Bi of the sample S;
step S213, predicting the input sample x by using the decision tree corresponding to the intestines-shaped Bi.
10. The method for evaluating intestinal health according to claim 7, wherein the step S21 of predicting the non-health probability comprises:
and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.
CN202110764854.7A 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method Active CN113486954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110764854.7A CN113486954B (en) 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110764854.7A CN113486954B (en) 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method

Publications (2)

Publication Number Publication Date
CN113486954A CN113486954A (en) 2021-10-08
CN113486954B true CN113486954B (en) 2023-04-07

Family

ID=77941519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110764854.7A Active CN113486954B (en) 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method

Country Status (1)

Country Link
CN (1) CN113486954B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016008954A1 (en) * 2014-07-15 2016-01-21 Institut National De La Recherche Agronomique Gut bacterial species in hepatic diseases
CN110730665A (en) * 2017-04-07 2020-01-24 儿童医院医疗中心 Treatment of inflammatory bowel disease with 2' -fucosyllactose compounds

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3062868A1 (en) * 2016-08-01 2018-02-08 Scaled Microbiomics, Llc Systems and methods for altering microbiome to reduce disease risk and manifestations of disease
EP3634434A4 (en) * 2017-05-12 2021-06-09 The Regents of The University of California Treating and detecting dysbiosis
SG11202002500SA (en) * 2017-11-06 2020-04-29 Psomagen Inc Control processes for microorganism-related characterization processes
KR20210091119A (en) * 2018-08-17 2021-07-21 베단타 바이오사이언시즈, 인크. How to reduce intestinal microbiome and restore microbiome
CN112992351B (en) * 2021-03-09 2024-03-08 广西爱生生命科技有限公司 Feature expression method and evaluation method for human intestinal health state

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016008954A1 (en) * 2014-07-15 2016-01-21 Institut National De La Recherche Agronomique Gut bacterial species in hepatic diseases
CN110730665A (en) * 2017-04-07 2020-01-24 儿童医院医疗中心 Treatment of inflammatory bowel disease with 2' -fucosyllactose compounds

Also Published As

Publication number Publication date
CN113486954A (en) 2021-10-08

Similar Documents

Publication Publication Date Title
Wong et al. Expanding the UniFrac toolbox
CN112151118B (en) Multi-time-sequence intestinal flora data analysis process control method
Corchado et al. Model of experts for decision support in the diagnosis of leukemia patients
CN111710364B (en) Method, device, terminal and storage medium for acquiring flora marker
Baskar et al. Classification system for lung cancer nodule using machine learning technique and CT images
CN111206079B (en) Death time inference method based on microbiome sequencing data and machine learning algorithm
CN116259415A (en) Patient medicine taking compliance prediction method based on machine learning
CN112182257A (en) Artificial intelligence data cleaning method based on neural network
Duygan et al. Recent advances in microbial community analysis from machine learning of multiparametric flow cytometry data
Giuste et al. Explainable synthetic image generation to improve risk assessment of rare pediatric heart transplant rejection
KR20200133067A (en) Method and system for predicting disease from gut microbial data
CN113486954B (en) Intestinal microecological differential bacteria classification processing method and intestinal health assessment method
CN112908414A (en) Large-scale single cell typing method, system and storage medium
CN115798685A (en) Depression diet management method based on food image segmentation
CN112992351B (en) Feature expression method and evaluation method for human intestinal health state
CN115527608A (en) Intestinal age prediction method and system
Hagos et al. Cell abundance aware deep learning for cell detection on highly imbalanced pathological data
CN111383716B (en) Screening method, screening device, screening computer device and screening storage medium
Depeursinge et al. A classification framework for lung tissue categorization
Cudic et al. Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs
Queyrel et al. Reject and cascade classifier with subgroup discovery for interpretable metagenomic signatures
CN117437976B (en) Disease risk screening method and system based on gene detection
CN117312893B (en) Evaluation method and related device for flora matching degree
US20210313016A1 (en) Machine-learning method and apparatus to isolate chemical signatures
Brett Van Tassel Examining Disease through Microbiome Data Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant