CN113486954A - Intestinal micro-ecological differential bacteria classification processing method and intestinal health assessment method - Google Patents

Intestinal micro-ecological differential bacteria classification processing method and intestinal health assessment method Download PDF

Info

Publication number
CN113486954A
CN113486954A CN202110764854.7A CN202110764854A CN113486954A CN 113486954 A CN113486954 A CN 113486954A CN 202110764854 A CN202110764854 A CN 202110764854A CN 113486954 A CN113486954 A CN 113486954A
Authority
CN
China
Prior art keywords
sample
intestinal
samples
abundance
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110764854.7A
Other languages
Chinese (zh)
Other versions
CN113486954B (en
Inventor
陈晓春
王小军
覃涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Aisheng Life Technology Co ltd
Original Assignee
Guangxi Aisheng Life Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Aisheng Life Technology Co ltd filed Critical Guangxi Aisheng Life Technology Co ltd
Priority to CN202110764854.7A priority Critical patent/CN113486954B/en
Publication of CN113486954A publication Critical patent/CN113486954A/en
Application granted granted Critical
Publication of CN113486954B publication Critical patent/CN113486954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a classification processing method of intestinal microecological differential bacteria and an intestinal health assessment method, wherein the classification processing method comprises the following steps: step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG; step S11, converting the digitized feature of each sample from absolute abundance to relative abundance; step S12, dividing all samples in the sample set A into k groups of intestinal types according to the similarity of absolute abundance, marking the classified intestinal type set as AB, then dividing the unhealthy group sample DG and the control group sample CG in the intestinal type set AB, marking the unhealthy group sample set as DG, and the control group sample set as CG; and step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group, and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types. The invention not only can effectively extract the sample characteristics with high dimension and large abundance change range, but also has stronger anti-interference performance.

Description

Intestinal micro-ecological differential bacteria classification processing method and intestinal health assessment method
Technical Field
The invention relates to an intestinal microbial data analysis and processing method, in particular to an intestinal microecological differential bacterium classification and processing method and an intestinal health assessment method.
Background
In recent years, it has been found that the gut microbiota not only plays a role in cardiovascular and metabolic-related unhealthy, but also is an important environmental factor for several tumors, including colorectal, liver, and breast cancers, among others. With the development of new generation gene sequencing and other genomic technologies and metabonomics, the role of intestinal microbiota in tumorigenesis and development is concerned, and more animals or clinical studies show that the intestinal microbiota can become a marker and a treatment target point for non-healthy screening and prognosis prediction of intestinal health, liver cancer and the like.
On the other hand, the structure of the intestinal microbiota can show different changes with the age, and especially the change of the diet structure related to the age can change the biological diversity of the intestinal flora, determine the relative abundance of specific flora, cause the flora to be disordered, and have negative influence on the physiology of a host. The microbiota profile of aging individuals is not dominated by a particular species group, but rather by a reduced diversity of the intestinal flora, for example, for patients with a high debilitating score, the fecal sample has a significantly reduced number of lactic acid bacteria, a reduced ratio of prevobacterium and bacteroides/prevobacterium, while the increase in enterobacteria, ruminococcus, has a significant negative correlation between debilitating and intestinal flora diversity.
The data used for intestinal microecology analysis are often obtained by high throughput sequencing, described by a matrix structure. Each row of the matrix corresponds to a sample and the abundance of different flora contained in the sample, and all the sample data are combined together to form a matrix structure. The sum of the abundance of the flora in each sample is stable, and if the abundance of one component is reduced, the abundance of one or more other components may be increased.
In the existing references, most of the existing references focus on constructing an overall processing system and outputting analysis results, and the content is relatively homogeneous. Specifically, refer to patent document CN105046094B, which constructs a processing system including analysis and storage of intestinal flora data, including a dynamic database for obtaining the latest detection parameters;
please refer to patent document No. CN107506582A, which constructs a health risk prediction system based on intestinal microorganisms, and focuses on the overall module construction and evaluation result format design;
please refer to patent publication No. CN108841974A, which evaluates maturity by comparing the similarity in composition of infant and maternal intestinal microbes;
please refer to patent document No. CN110144415A, which discloses a method for predicting the health and healthy immunity of introduced cows based on intestinal flora, using conventional abundance analysis and diversity analysis;
please refer to patent document CN111161794A, which is used for evaluating the intestinal flora of a target object to obtain specific intestinal flora information.
Please refer to patent documents CN111462819A and CN112151118A, both of which focus on software automation of the analysis process of intestinal flora data, and the analysis contents mainly include abundance data, analysis of probiotics and pathogenic bacteria, diversity analysis, and interpretation by combining with disease-related database.
In practical application, there are many methods for analyzing intestinal micro-ecological data, generally, an unhealthy group and a healthy group are constructed and contrasted, differential bacteria are obtained through comparative analysis, and then a model capable of predicting human unhealthy is constructed by applying supervised learning, and the difficulty lies in how to effectively extract sample characteristics with high dimension and large abundance variation range.
Disclosure of Invention
The invention aims to solve the technical problems of the prior art, and provides an intestinal micro-ecological difference bacterium classification processing method and an intestinal health assessment method which can solve the problems of wide distribution range, strong interference and the like of intestinal flora data, acquire key characteristics based on a polymerization classification algorithm and further improve the middle analysis process.
In order to solve the technical problems, the invention adopts the following technical scheme.
A method for classifying intestinal microecological differential bacteria comprises the following steps: step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of the basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing; step S11, converting the digitized features of each sample from absolute abundance to relative abundance, and screening out the absolute abundance features of which the occurrence frequency is less than a preset value and the relative abundance is close to zero; step S12, dividing all samples in the sample set A into k groups of intestine types according to the similarity of absolute abundance, and recording the classified intestine type set as AB: { B1,B2,...,BkAnd separating the samples DG of the non-healthy group from the samples CG of the control group in the intestinal type set AB, and recording the samples DG of the non-healthy group as: { D1,D2,...,DkControl group sample set as CG: { C1,C2,...,Ck}; step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types1,F2,...,Fi,...,Fk}。
Preferably, in step S10, the features of n samples are arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU1+OTU2+...+OTUmWhere OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature.
Preferably, in step S11, if the number of samples with an absolute abundance feature abundance of more than 0.01% among n samples in the matrix is S, the frequency of occurrence of the flora is the ratio S/n.
Preferably, in the step S11, for each type of intestine type data set B in the intestine type set ABiAnd screening out low abundance characteristic interference data, wherein the low abundance characteristic is defined as: less than 10% andthe relative abundance value is less than 0.01 percent.
A method for assessing gut health, comprising: step S20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal microecological difference bacterium classification processing method of claim 1; step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.
Preferably, in step S20, the process of constructing k machine learning models includes: step S200, inputting k groups of intestine type difference characteristics DF: { F1,F2,...,Fi,...,FkAnd (c) setting the difference characteristic of the i-th group of intestine types as FiThe residual characteristic quantity after screening is p, and the sample quantity is si; step S201, for a data set containing si samples and p features: selecting sm samples from si samples through random sampling, selecting t characteristics by using a difference characteristic screening method, establishing a decision tree by using the t characteristics aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree: { tree1,tree2,...,treek}。
Preferably, in step S21, the prediction process of the unhealthy probability includes: step S210, converting the absolute abundance of the flora in the input sample data into relative abundance, and setting the input sample as x: [ R: [, the absolute abundance of the flora in the input sample data1,R2,...,Ri,...,Rm](ii) a Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity; step S212, judging the intestinal Bi to which the sample S belongs; step S213, the input sample x is predicted by using the decision tree corresponding to the intestinal Bi.
Preferably, in step S21, the prediction process of the unhealthy probability includes: and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.
In the intestinal micro-ecological difference bacterium classification processing method and the intestinal health assessment method, the steps of sample characteristic digitization processing, format conversion and filtration of digitized characteristics, sample similarity clustering and difference characteristic screening are sequentially carried out on the collected sample set A.
Drawings
FIG. 1 is a flow chart of the intestinal micro-ecological difference bacteria classification processing method of the present invention;
fig. 2 is a flow chart of the intestinal health assessment method according to the present invention.
Detailed Description
The invention is described in more detail below with reference to the figures and examples.
The invention discloses a classification processing method of intestinal microecological difference bacteria, please refer to fig. 1, which comprises the following steps:
step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of the basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing;
step S11, converting the digitized features of each sample from absolute abundance to relative abundance, and screening out the absolute abundance features of which the occurrence frequency is less than a preset value and the relative abundance is close to zero;
step S12, dividing all samples in the sample set A into k groups of intestine types according to the similarity of absolute abundance, and recording the classified intestine type set as AB: { B1,B2,...,BkAnd separating the samples DG of the non-healthy group from the samples CG of the control group in the intestinal type set AB, and recording the samples DG of the non-healthy group as: { D1,D2,...,DkControl group sample set as CG: { C1,C2,...,Ck};
Step S13, calculating the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group and sequencing to obtain the difference characteristic DF corresponding to each of the k groups of intestinal types1,F2,...,Fi,...,Fk}。
In the method, for the collected sample set A, sample feature digitization processing, format conversion and filtering of digitization features, sample similarity clustering and difference feature screening are sequentially carried out, compared with the prior art, the method obtains key features based on a cluster classification algorithm, further improves the intermediate analysis process, can effectively extract sample features with high dimension and large abundance variation range, has stronger anti-interference performance, and better meets the application requirements.
For detailed implementation of steps S10 to S13, please refer to the following first to fourth embodiments.
Example one
In this embodiment, the step S10 mainly implements a process of digitizing the sample features.
For the step S10, a data is collected, which includes n samples, including the non-healthy sample DG and the control sample CG labeled by the expert, where the number of the samples of the non-healthy sample DG is n1, the number of the samples of the control sample CG is n2, and it is noted that the values of n1 and n2 are close, and n is n1+ n 2. The sample characteristics are composed of basic information of the sample and absolute abundance (OTU characteristics) of intestinal flora obtained after 16S sequencing, and each sample comprises m characteristics at most.
In step S10, the features of n samples are arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU1+OTU2+...+OTUmWhere OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature. Note that there are m OTU eigenvalues per sample.
During the treatment, samples with non-health history and antibiotic intake history were filtered.
Example two
In this embodiment, the step S11 mainly implements format conversion and filtering functions of the digital feature.
For the step S11, the method further comprises a format conversion step and a low abundance feature screening step:
for the format conversion step, the relative abundance of the population is the ratio of the abundance of a certain population in a given sample to the sum of the abundances of all populations in that sample, and therefore, in said step S11, the absolute abundance is converted to a relative abundance as follows:
setting abundance of a certain flora in the sample to RijIn the non-healthy group DG or the control group CG, there are n1 or n2 samples, respectively, each sample having m features, and the relative abundance of the jth feature in the ith sample is:
Figure BDA0003150324910000071
based on the above calculation, the digitized features of each sample are converted from absolute abundance to relative abundance.
For the step of screening out the low abundance features, first, the calculation of the occurrence frequency of the flora is performed, and in the step S11, if the number of samples in the matrix (composed of all samples and features) in which an absolute abundance feature is more than 0.01% among n samples is S, the occurrence frequency of the flora is the ratio S/n.
The specific screening means is as follows: in said step S11, for each type of bowel type data set B in the bowel type set ABiAnd screening out low abundance characteristic interference data, wherein the low abundance characteristic is defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%. That is, the OTU features with small frequency and near-zero relative abundance in the sample are screened out.
After the treatment, the abundance characteristic of the intestinal flora is represented by relative abundance instead of absolute abundance, and the characteristic number is reduced from m to p, wherein p < m.
EXAMPLE III
In this embodiment, the step S12 is a sample similarity clustering process.
For said step S12, it involves a sample similarity clustering procedure: all samples in the sample A set are classified into a plurality of types according to the similarity of OTU characteristics, samples with high similarity are classified into the same type, and are classified into k types of intestinal types, and each type of intestinal type comprises a plurality of samples.
Further, in the step S12, a sample x [ x ] is set1,x2,...,xi,...,xm]And sample y: [ y:1,y2,...,yi,...ym]the similarity sim of sample x and sample y is calculated using the following formula:
Figure BDA0003150324910000081
wherein, knowing that the sample set A contains a large number of samples with the same structure as x or y, a data clustering method can be adopted, namely, the samples are divided into a plurality of groups by calculating the similarity between every two samples, the samples with high similarity are divided into the same group, and if the samples are allowed to be divided into k groups at most, A can be expressed as { B }1,B2,...,BkEach group comprises a plurality of samples with higher similarity among each other, and samples of different groups have lower similarity.
And setting a new sample s _ n, calculating the similarity between the new sample s _ n and samples in different groups in order to calculate which group the new sample s _ n belongs to, finally finding the sample q with the closest similarity, and attributing s _ n to the sample group where q belongs to.
The q features with the highest abundance in the sample may be selected (e.g., take q 10) and then similarity and classification may be calculated. The classified intestinal type set was designated AB: { B1,B2,...,Bk}. The non-healthy group (DG) samples in the AB were then separated from the Control Group (CG) samples, such that the set of non-healthy group samples was designated as DG: { D1,D2,...,DkAnd the control group sample set is denoted as CG: { C1,C2,...,Ck}。
Example four
In this embodiment, the step S13 mainly relates to a process of screening for difference characteristics based on the intestinal type.
For said step S13, in the screening process based on the difference characteristics of the intestinal type:
the combination of samples obtained after recombination according to the intestinal type category is: d1|C1,D2|C2,...,Di|Ci,...,Dk|Ck. Note Di|Ci(1 ≦ i ≦ k) because they belong to the same intestinal type. This allows similar types of intestines to be put together for differential characterization.
For non-healthy group DiAnd control group CiComparing the flora characteristics, calculating the value difference of all characteristics in the two groups of samples and sequencing the values, thus obtaining the difference characteristic sequence. Finally, the corresponding difference characteristics of the k groups of intestinal types can be obtained and are marked as DF: { F1,F2,...,Fi,...,Fk}。
By analyzing the relationship between the difference value of the difference feature sequence and the sample mark class, t features (t ≦ q) with obvious advantages can be selected for machine learning modeling analysis of k groups of samples.
Based on the method described in the first to fourth embodiments, the present invention realizes the screening of differential micro-ecological bacteria in human intestinal tract, and on this basis, the present invention further relates to a machine learning model for predicting the health status of intestinal tract, specifically to an intestinal tract health assessment method, please refer to fig. 2, which includes:
step S20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal microecological difference bacterium classification processing method of claim 1;
step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.
For the step S20, the process of constructing k machine learning models includes:
first, there are k groups of intestinal types, and the difference in the i-th group is characterized by FiAnd contains p features and si samples. Then FiThe corresponding table is as follows:
sample numbering Feature 1 ... Characteristic j ... Characteristic t
Sample 1 Val11 ... Val1j ... Val1t
... ... ... ... ... ...
Sample j Valj1 ... Valjj ... Valjt
... ... ... ... ... ...
Sample si Valsi1 ... Valsij ... valsit
In the above table, the original sample contains m features, p features remain after filtering, and p is less than or equal to m; selecting q characteristics with abundance arranged in the front from p from high to low to calculate similarity, wherein q is less than or equal to p; and finally, obtaining t characteristics by utilizing difference analysis, wherein t is less than or equal to q.
The step S20 further includes:
step S200, inputting k groups of intestine type difference characteristics DF: { F1,F2,...,Fi,...,FkAnd (c) setting the difference characteristic of the i-th group of intestine types as FiThe residual characteristic quantity after screening is p, and the sample quantity is si;
step S201, for a data set containing si samples and p features:
selecting sm samples from si samples through random sampling, selecting t features through a difference feature screening method, establishing a decision tree by using the t features aiming at the selected samples, repeating the sampling step for k times, and then generating k decision tree models dcTree:{tree1,tree2,...,treek}。
That is, for a data set containing si samples and p features: selecting sm samples from the si samples through random sampling, selecting t features by using a difference feature screening method, and establishing a decision tree for the selected samples by using the features. Repeating the sampling step k times to generate k decision trees to form a decision tree model. And for new unclassified data, the k decision tree models established above can be used for judging one by one, so that k judgment results can be obtained, and finally, the best judgment result is selected as the class to which the new sample belongs, and the result is output.
In practical applications, the above model is not limited to the decision tree, but is also applicable to other machine learning models such as linear regression, K-nearest neighbor or support vector machine methods.
For the step S21, please refer to fig. 2, there are two ways to realize the non-health probability prediction:
in a first way, the prediction process of the non-health probability may be:
step S210, converting the absolute abundance of the flora in the input sample data into relative abundance, and setting the input sample as x: [ R: [, the absolute abundance of the flora in the input sample data1,R2,...,Ri,...,Rm];
Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity;
step S212, judging the intestinal Bi to which the sample S belongs;
step S213, the input sample x is predicted by using the decision tree corresponding to the intestinal Bi.
In the above steps, the unhealthy probability of the sample is predicted by searching which type of intestine S belongs to, and when S belongs to the type of intestine k (0< k < NC +1), using a machine learning model of the k-th type of intestine.
In a second way, the prediction process of the non-health probability may be:
and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.
According to the method, similarity comparison is not carried out, samples x are directly input into k machine learning models, k prediction unhealthy probabilities can be obtained, and the probability with the highest probability is selected as the unhealthy probability of the samples and output.
It should be noted that, the above-mentioned contents of the present invention relate to a method for classifying and processing intestinal micro-ecological difference bacteria and a method for evaluating intestinal health, in practical applications, both of them can be used in combination, or can be taken out and flexibly applied according to needs, and the present invention does not limit the specific application manner, that is, the method for classifying and processing intestinal micro-ecological difference bacteria and the method for evaluating intestinal health in the present invention are both within the protection scope of the present invention.
In order to more clearly describe the technical solution of the present invention, the present invention provides the following specific detailed examples.
Example one
The embodiment relates to an evaluation method for identifying intestinal health based on human intestinal flora conditions, which comprises the following steps:
step S1, prepare group A samples, pre-process the samples, remove unqualified samples, filter out samples with non-health history and antibiotic taking history. And then, screening out the flora characteristics with relative abundance of less than 0.01% in the samples with relative abundance of more than 90%.
Step S2, according to the formula
Figure BDA0003150324910000111
And
Figure BDA0003150324910000112
calculating the similarity between every two samples in the group A, dividing the samples with higher similarity into one group, and classifying the samples into 5 groups of intestine types, namely { B }1,B2,B3,B4,B5}。
For each type of bowel type data set BiAccording to the formula
Figure BDA0003150324910000121
It is first converted to relative abundance.
In step S3, samples from 5 intestinal types were divided into non-healthy groups DG: { D: } respectively1,D2,D3,D4,D5{ C } and control group CG: { C1,C2,C3,C4,C5And calculating and ordering the relative abundance difference of the two groups of samples. The differential flora of 5 groups of intestinal type data sets was obtained and is marked as DF: { F1,F2,F3,F4,F5}。
Step S4, leaving only the differential bacterial characteristic DF for each type of gut-type dataset: { F1,F2,F3,F4,F5}. Then, a random sampling mode is adopted to select smAnd constructing k decision trees by using the samples to form a decision tree model. The 5 intestinal types obtained in this way gave 5 decision tree models, which were designated dcTree: { tree1, tree2, tree3, tree4, tree5 }.
Step S5, for the new sample input to be detected, firstly according to the formula
Figure BDA0003150324910000122
Converting the absolute abundance into relative abundance, comparing the relative abundance with samples in 5 intestinal type sets one by one, and calculating the similarity to obtain a sample s with the highest similarity to the input sample, if s belongs to BiIntestinal type, the new sample can be classified as BiType analogous to intestine, use BiAnd predicting the input sample by the decision tree model corresponding to the intestinal type.
The procedure for the specific test data is as follows:
1. sample specification and pretreatment, see fig. 1 for sample processing.
Sample a, containing both non-healthy and control classifications, is known. The sample is characterized by a relative abundance value of the flora in the intestinal environment obtained after 16s sequencing. Thus, each sample is equivalent to a row vector in a matrix, and the composition is: number + sampling address + non-healthy type + antibiotic type + OTU1+OTU2+...+OTULenOTU is the relative abundance of a certain flora in a sample, eachThe sample had 1000 OTU features.
OTU for a certain flora in A samplejThe relative abundance in 90% of the samples is less than 0.01%, and the OTU is addedjFeature filtering, which filters out a large number of features with relative abundance close to 0. The filtration leaves 300 OTU characteristic values, as in the following table:
index group otu1 otu5 otu6 otu9
sample1 Control 0.012576539 0 0.000103199 0
sample2 Control 0 0 0.0000903276 0
sample3 Control 0.007804594 0 0 0
sample4 Control 0.001026461 0 0.000306605 0
sample5 Control 0.000414224 0 0 0
sample6 Case 0 0 0.000216911 0
sample7 Case 0.001686275 0 0.002457516 0
sample8 Case 0 0 0.0000467978 0.000103107
sample9 Control 0.000212123 0 0 0.00007152154
sample10 Control 0.002755471 0 0 0
2. grouping samples: calculating the similarity of every two samples, putting the samples with close similarity together to form an intestine type according to the comparison of numerical values, and dividing all the samples into 5 intestine types which are marked as { B }1,B2,...,B5And a proportion of non-healthy and control samples is assigned to each intestine type. The similarity values are as follows:
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7
Sample1 0 0.182662 0.459110 0.487837 0.648875 0.608471 0.679105
Sample2 0.182662 0 0.376313 0.390729 0.561831 0.520588 0.593454
Sample3 0.459110 0.376313 0 0.328543 0.475267 0.458186 0.512884
Sample4 0.487837 0.390729 0.328543 0 0.210214 0.202143 0.315686
Sample5 0.648875 0.561831 0.475267 0.210214 0 0.098052 0.239273
Sample6 0.608471 0.520588 0.458186 0.202143 0.098052 0 0.252005
Sample7 0.679105 0.593454 0.512884 0.315686 0.239273 0.252005 0
3. screening differential bacteria: defining a bowel-type data set BjEvaluation of the parameter dR for the difference in relative abundance of DG in the non-healthy group or CG in the control groupDC,jCorresponding to the formula: dRDC,j=log2(ARD,j/ARC,j) Wherein ARD,jAnd ARC,jRepresents the sum of the relative abundance of feature j in the non-healthy sample and the control sample, respectively. And sorting according to the absolute value of the calculation result, and selecting the differential bacteria. 20,18,19,18 and 19 different bacteria are selected from the 5 intestinal types.
B1 B2 B3 B4 B5
Otu15 Otu49 Otu48 Otu12 Otu22
Otu23 Otu61 Otu60 Otu15 Otu26
Otu40 Otu76 Otu75 Otu75 Otu34
Otu41 Otu89 Otu88 Otu67 Otu55
Otu48 Otu90 Otu89 Otu35 Otu89
Otu60 Otu101 Otu100 Otu99 Otu27
Otu75 Otu105 Otu105 Otu100 Otu101
Otu88 Otu47 Otu49 Otu46 Otu47
Otu89 Otu61 Otu61 Otu25 Otu69
Otu100 Otu76 Otu76 Otu71 Otu37
Otu105 Otu83 Otu83 Otu81 Otu82
Otu109 Otu80 Otu82 Otu83 Otu81
Otu111 Otu105 Otu101 Otu101 Otu109
Otu120 Otu107 Otu102 Otu103 Otu36
Otu124 Ot37 Otu47 Otu7 Otu43
Otu136 Otu39 Otu64 Otu17 Otu75
Otu142 Otu42 Otu77 Otu77 Otu33
Otu150 Otu69 Otu85 Otu93 Otu16
Otu152 Otu15 Otu24
Otu163
4. And constructing a decision tree model.
From the BjIn the colon-like data set, samples are randomly sampled and selected. E.g. when j is 1, with B1Samples of gut type were used as training and validation sets, characterized by 20 differential flora. The corresponding table is as follows:
sample numbering Otu15 ... Otu109 ... Otu163 Grouping
... ... ... ... ... ... ...
Sample15 Valj1 ... Valjj ... Valjp Case
... ... ... ... ... ...
Sample1000 Valsi1 ... Valsij ... valsip Control
Firstly, a decision tree initial model is constructed, then the model is trained by using the data in the table, and the decision tree model obtained after training can be used for judging whether the sample belongs to an unhealthy group or a healthy group. Because 5 types of intestinal data are included, 5 decision tree models are obtained after training and are respectively recorded as: tree1, tree2, tree3, tree4, and tree 5.
5. Non-healthy calculation.
For a new sample input to be tested, 300 OTU features are included.
First according to the formula
Figure BDA0003150324910000141
Converting the absolute abundance into a relative abundanceThen comparing the samples with samples in 5 intestinal type sets one by one and calculating the similarity to obtain a sample s with the highest similarity to the input sample, if s belongs to BiIntestinal type, the new sample can be classified as BiType analogous to intestine, use BiAnd predicting the input sample by the decision tree model corresponding to the intestinal type. Here, it is assumed that s belongs to B3And judging the new sample input by adopting a tree3 decision tree model, and finally determining whether the input belongs to the type of 'unhealthy' or 'healthy'.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents or improvements made within the technical scope of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for classifying intestinal microecological differential bacteria is characterized by comprising the following steps:
step S10, collecting a sample set A, wherein the sample set A comprises n samples at least consisting of a non-healthy group sample DG and a control group sample CG, and the sample characteristics of each sample consist of the basic information of the sample and the absolute abundance of intestinal flora obtained after 16S sequencing;
step S11, converting the digitized features of each sample from absolute abundance to relative abundance, and screening out the absolute abundance features of which the occurrence frequency is less than a preset value and the relative abundance is close to zero;
step S12, dividing all samples in the sample set A into k groups of intestine types according to the similarity of absolute abundance, and recording the classified intestine type set as AB: { B1,B2,...,BkAnd separating the samples DG of the non-healthy group from the samples CG of the control group in the intestinal type set AB, and recording the samples DG of the non-healthy group as: { D1,D2,...,DkControl group sample set as CG: { C1,C2,...,Ck};
Step S13, calculating and sequencing the relative abundance difference between the sample set DG of the non-healthy group and the sample set CG of the control group to obtain the corresponding difference of the k groups of intestinal typesFeature DF: { F1,F2,...,Fi,...,Fk}。
2. The method for classifying intestinal micro-ecological difference bacteria according to claim 1, wherein in step S10, the features of n samples are arranged into a matrix, each row of the matrix is composed of a sample feature, and the sample features are expressed as: number + age + sampling address + non-healthy type + antibiotic type + OTU1+OTU2+...+OTUmWhere OTU represents absolute abundance and m represents the number of OTUs contained in each sample feature.
3. The method for classifying intestinal micro-ecological difference bacteria according to claim 2, wherein in step S11, the absolute abundance is converted into the relative abundance according to the following steps:
setting abundance of a certain flora in the sample to RijThen the relative abundance of the jth feature in the ith sample is:
Figure FDA0003150324900000011
4. the method for classifying intestinal micro-ecological bacteria according to claim 3, wherein in step S11, if the number of samples with an absolute abundance characteristic of more than 0.01% among n samples in the matrix is S, the frequency of occurrence of the bacteria colony is a ratio S/n.
5. The method for classifying intestinal micro-ecological difference bacteria according to claim 4, wherein in step S11, for each type of intestine type data set B in the intestine type set ABiAnd screening out low abundance characteristic interference data, wherein the low abundance characteristic is defined as: a frequency of occurrence of less than 10% and a relative abundance value of less than 0.01%.
6. The method for classifying intestinal micro-ecological difference bacteria according to claim 1, wherein in step S12, a sample x [ x: [ x ] is provided1,x2,...,xi,...,xm]And sample y: [ y:1,y2,...,yi,...ym]the similarity sim of sample x and sample y is calculated using the following formula:
Figure FDA0003150324900000021
Figure FDA0003150324900000022
7. a method for assessing gut health, comprising:
step S20, constructing k machine learning models according to k groups of intestinal type difference characteristics obtained by the intestinal microecological difference bacterium classification processing method of claim 1;
step S21, inputting a group of sample data with known absolute abundance, converting the absolute abundance of the sample data into relative abundance, and predicting the unhealthy probability of the sample data by using the k machine learning models obtained in the step S20.
8. The method for assessing intestinal health of claim 7, wherein the step S20 of constructing k machine learning models comprises:
step S200, inputting k groups of intestine type difference characteristics DF: { F1,F2,...,Fi,...,FkAnd (c) setting the difference characteristic of the i-th group of intestine types as FiThe residual characteristic quantity after screening is p, and the sample quantity is si;
step S201, for a data set containing si samples and p features:
selecting sm samples from si samples by random sampling, and selecting t characteristics by using a difference characteristic screening methodC, establishing a decision tree by using t characteristics aiming at the selected samples, repeating the sampling steps for k times, and then generating k decision tree models dcTree: { tree1,tree2,...,treek}。
9. The method for assessing intestinal health of claim 7, wherein the step S21 is performed by predicting the probability of non-health including:
step S210, converting the absolute abundance of the flora in the input sample data into relative abundance, and setting the input sample as x: [ R: [, the absolute abundance of the flora in the input sample data1,R2,...,Ri,...,Rm];
Step S211, comparing the input sample x with the samples in the sample set A one by one, and finding out a sample S with the highest similarity;
step S212, judging the intestinal Bi to which the sample S belongs;
step S213, the input sample x is predicted by using the decision tree corresponding to the intestinal Bi.
10. The method for assessing intestinal health of claim 7, wherein the step S21 is performed by predicting the probability of non-health including:
and predicting the input sample x by using k machine learning models respectively to obtain k unhealthy probabilities, and determining and outputting the highest unhealthy probability.
CN202110764854.7A 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method Active CN113486954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110764854.7A CN113486954B (en) 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110764854.7A CN113486954B (en) 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method

Publications (2)

Publication Number Publication Date
CN113486954A true CN113486954A (en) 2021-10-08
CN113486954B CN113486954B (en) 2023-04-07

Family

ID=77941519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110764854.7A Active CN113486954B (en) 2021-07-06 2021-07-06 Intestinal microecological differential bacteria classification processing method and intestinal health assessment method

Country Status (1)

Country Link
CN (1) CN113486954B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016008954A1 (en) * 2014-07-15 2016-01-21 Institut National De La Recherche Agronomique Gut bacterial species in hepatic diseases
US20190256583A1 (en) * 2016-08-01 2019-08-22 Scaled Microbiomics, Llc Systems and methods for altering microbiome to reduce disease risk and manifestations of disease
CN110730665A (en) * 2017-04-07 2020-01-24 儿童医院医疗中心 Treatment of inflammatory bowel disease with 2' -fucosyllactose compounds
CN111315898A (en) * 2017-11-06 2020-06-19 普梭梅根公司 Control process for a microorganism-related characterization process
US20200332344A1 (en) * 2017-05-12 2020-10-22 The Regents Of The University Of California Treating and detecting dysbiosis
CN112888447A (en) * 2018-08-17 2021-06-01 韦丹塔生物科学股份有限公司 Methods for reducing dysbacteriosis and restoring microbial flora
CN112992351A (en) * 2021-03-09 2021-06-18 广西爱生生命科技有限公司 Feature expression method and evaluation method for human intestinal health state

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016008954A1 (en) * 2014-07-15 2016-01-21 Institut National De La Recherche Agronomique Gut bacterial species in hepatic diseases
US20190256583A1 (en) * 2016-08-01 2019-08-22 Scaled Microbiomics, Llc Systems and methods for altering microbiome to reduce disease risk and manifestations of disease
CN110730665A (en) * 2017-04-07 2020-01-24 儿童医院医疗中心 Treatment of inflammatory bowel disease with 2' -fucosyllactose compounds
US20200332344A1 (en) * 2017-05-12 2020-10-22 The Regents Of The University Of California Treating and detecting dysbiosis
CN111315898A (en) * 2017-11-06 2020-06-19 普梭梅根公司 Control process for a microorganism-related characterization process
CN112888447A (en) * 2018-08-17 2021-06-01 韦丹塔生物科学股份有限公司 Methods for reducing dysbacteriosis and restoring microbial flora
CN112992351A (en) * 2021-03-09 2021-06-18 广西爱生生命科技有限公司 Feature expression method and evaluation method for human intestinal health state

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KRASNY L等: ""Identification of bacteria using mass spectrometry techniques"", 《INT J MASS SPECTROM》 *
赵敏等: ""健康人群不同年龄组肠道菌群特征预测模型的研究"", 《解放军医学院学报》 *

Also Published As

Publication number Publication date
CN113486954B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
Wong et al. Expanding the UniFrac toolbox
Khan et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
Zhang et al. Using multi-label classification for acoustic pattern detection and assisting bird species surveys
CN112151118B (en) Multi-time-sequence intestinal flora data analysis process control method
Corchado et al. Model of experts for decision support in the diagnosis of leukemia patients
CN111710364B (en) Method, device, terminal and storage medium for acquiring flora marker
CN111206079B (en) Death time inference method based on microbiome sequencing data and machine learning algorithm
CN112182257A (en) Artificial intelligence data cleaning method based on neural network
Asnicar et al. Machine learning for microbiologists
CN116259415A (en) Patient medicine taking compliance prediction method based on machine learning
Duygan et al. Recent advances in microbial community analysis from machine learning of multiparametric flow cytometry data
Karaçalı Quasi-supervised learning for biomedical data analysis
KR20200133067A (en) Method and system for predicting disease from gut microbial data
Ross et al. Metagenomic predictions: a review 10 years on
CN113486954B (en) Intestinal microecological differential bacteria classification processing method and intestinal health assessment method
CN112908414A (en) Large-scale single cell typing method, system and storage medium
Wang Multiscale adaptive differential abundance analysis in microbial compositional data
DeTomaso et al. Identifying informative gene modules across modalities of single cell genomics
CN112992351B (en) Feature expression method and evaluation method for human intestinal health state
Russ et al. A Harmonized atlas of spinal cord cell types and their computational classification
CN115527608A (en) Intestinal age prediction method and system
CN111383716B (en) Screening method, screening device, screening computer device and screening storage medium
Plichta Recognition of species and genera of bacteria by means of the product of weights of the classifiers
Queyrel et al. Reject and cascade classifier with subgroup discovery for interpretable metagenomic signatures
CN117437976B (en) Disease risk screening method and system based on gene detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant