CN112151117A - Dynamic observation device based on time series metagenome data and detection method thereof - Google Patents

Dynamic observation device based on time series metagenome data and detection method thereof Download PDF

Info

Publication number
CN112151117A
CN112151117A CN202010801019.1A CN202010801019A CN112151117A CN 112151117 A CN112151117 A CN 112151117A CN 202010801019 A CN202010801019 A CN 202010801019A CN 112151117 A CN112151117 A CN 112151117A
Authority
CN
China
Prior art keywords
species
time
sequence
translation
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010801019.1A
Other languages
Chinese (zh)
Other versions
CN112151117B (en
Inventor
邓煜盛
韩丽娟
周勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kmbgi Gene Tech Co ltd
Original Assignee
Kmbgi Gene Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kmbgi Gene Tech Co ltd filed Critical Kmbgi Gene Tech Co ltd
Priority to CN202010801019.1A priority Critical patent/CN112151117B/en
Publication of CN112151117A publication Critical patent/CN112151117A/en
Application granted granted Critical
Publication of CN112151117B publication Critical patent/CN112151117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Public Health (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a dynamic observation device based on time sequence metagenome data and a detection method thereof. After pretreatment, strain similarity, correlation and ordinary time are calculated to obtain the flora interaction relation. Further, the flora is clustered and subjected to dimensionality reduction to obtain categories and dimensionality reduction positions, and a flora interaction network diagram is drawn by combining interaction relations to identify key strains, so that intestinal flora interaction network detection is realized, and help and reference are provided for personalized intervention on intestinal flora, improvement of human health and disease intervention.

Description

Dynamic observation device based on time series metagenome data and detection method thereof
Technical Field
The invention relates to the technical field of biological detection, in particular to a dynamic observation device based on time series metagenome data and a detection method thereof.
Background
With the progress of the second-generation sequencing technology and the reduction of the price, the research of the metagenome is rapidly developed. An increasing number of studies have found that the intestinal flora is associated with a variety of human health states, including digestive diseases (colorectal cancer, irritable bowel syndrome, etc.), metabolic diseases (obesity, type ii diabetes, etc.), psychiatric diseases (depression, etc.), and the like. Although there is increasing evidence that the intestinal flora is related to human health, the research results of the intestinal flora are not good in repeatability due to a plurality of factors affecting the intestinal flora, and the evidence level is low. Therefore, to improve this situation, more and more longitudinal studies are being designed for the study of the intestinal flora.
The intestinal flora plays a role of an ecological whole, and rarely exerts an influence directly by a certain bacterium. Therefore, exploring the flora interaction relationship, identifying key species in the interaction network is particularly important. However, the existing interworking network construction still has some disadvantages: 1. most researches only consider the relationship among floras at a certain time point, even in longitudinal researches, an interaction network is constructed only by using abundance information of the certain time point, all information of a plurality of time points in a time sequence is not fully utilized, the abundance of the floras is easy to fluctuate, and false correlation is easy to occur only according to the information of the certain time point. 2. Although a small percentage of studies have used information from multiple time points, there is a lack of consideration for translational sequence similarity and sequence context. The curve correlation is more considering the similarity of the curves, and also needs to consider the similarity of the sequences after translation. 3. Most of the existing research on the visualization of the interaction network aims to show the interaction relationship, and does not consider the relationship between the whole flora and the distance between strains, i.e. the distance between two strains in the graph is convenient to show, but not the real distance between the two strains.
Disclosure of Invention
In order to overcome the defects of the prior art, one of the purposes of the invention is to provide a dynamic observation device based on time series metagenome data, which can observe the dynamic change of an intestinal flora interaction network; the second objective of the present invention is to provide a method for detecting a dynamic observation device based on time series metagenome data, which can solve the problem of possible false interaction in the prior art and further display the context of strain changes.
One of the purposes of the invention is realized by adopting the following technical scheme:
a dynamic observation device based on time series metagenome data comprises:
the collecting device is used for collecting samples of human intestinal excrement at different time points; the sequencing device is used for carrying out gene extraction and sequencing on the collected intestinal flora sample to obtain intestinal flora information; the pretreatment analysis device is used for analyzing the relative abundance information of the intestinal colony to obtain a colony to be analyzed; the data analysis device is used for carrying out strain similarity, strain correlation, strain interaction relation, flora clustering and dimensionality reduction, flora interaction network construction and key strain identification on the colonies to be analyzed; a storage device; display device
The acquisition device, the sequencing device, the pretreatment analysis device and the data analysis device are sequentially connected from front to back; the storage device is respectively connected with the acquisition device, the preprocessing analysis device and the data analysis device; the display is connected with the data analysis device. The display is used for displaying information processed in the data analysis apparatus and for displaying a visualized user interface.
As the sample increases at time points, the network of microbial interactions changes, and the user can dynamically observe the changes and make a lateral comparison with previous results.
The second purpose of the invention is realized by adopting the following technical scheme:
the detection method of the dynamic observation device based on the time series metagenome data comprises the following steps:
1) sample acquisition: acquiring human intestinal stool samples of the same individual at different time points and corresponding individual basic information; respectively carrying out gene extraction and sequencing on the intestinal flora of each sample, obtaining the intestinal flora information corresponding to each sample by a reference genome comparison and annotation method, and obtaining the relative abundance information of the intestinal flora;
2) preprocessing intestinal flora information: according to the relative abundance information of the intestinal flora obtained in the step 1), carrying out species filtration to screen out low occurrence frequency and low abundance species; after filtering, carrying out normalization processing, converting the data into a wide format data form, and then carrying out normalization processing on each species; finally, filtering the low fluctuation species to obtain candidate colonies to be analyzed;
3) calculating strain similarity: dividing the candidate colonies to be analyzed obtained in the step 2) into two groups, obtaining time sequences of two strains in each group of colonies, and performing stretching on the two sequences by using Dynamic Time Warping (DTW) analysis to obtain a new stretched sequence; then, correlation analysis is carried out on the new sequences by using Pearson to obtain correlation coefficients and corresponding P values of the two new sequences;
4) calculating the correlation of strains and translation time: taking one strain sequence X of the two new sequences obtained in the step 3) as a reference, translating the other strain sequence Y from negative to positive (from left to right) according to a unit time interval, and filling the sequence missing position according to a tail value; performing Correlation analysis by using Temporal Correlation Coefficient (CORT) to obtain a Correlation Coefficient after each translation; taking the Y sequence as a reference, and translating the X sequence to calculate a correlation coefficient; selecting the maximum value of the absolute values in all the correlation coefficients as the correlation degree of the correlation coefficients, wherein the positive and negative values of the corresponding correlation coefficients are the correlation directions, the corresponding translation positions are the translation time of the two sequences, and the positive and negative values indicate the front and back sequence of the two sequences;
5) and (3) calculating the flora interaction relationship: forming a flora interaction relation matrix according to the correlation coefficient of the two bacterial colonies obtained in the step 3) and the correlation coefficient and the translation time between the two bacterial colonies obtained in the step 4); summarizing the P value obtained in the step 3) and the correlation coefficient obtained in the step 4), setting standard values of the P value and the correlation coefficient, and screening out similar and related strain pairs which accord with the standard values to obtain a final flora interaction relationship network;
6) and (3) flora clustering and dimensionality reduction calculation: calculating the distance between every two strains by using a DTW (differential time warping) method based on the Euclidean distance to form a strain distance matrix; clustering the floras by using a hierarchical clustering method by using a strain distance matrix, and selecting a clustering number according to different distances; reducing dimensions by using a strain distance matrix and a principal coordinate axis analysis method to obtain a dimension reduction diagram and coordinate positions of each strain after dimension reduction;
7) the method comprises the following steps of (1) visualizing a flora interaction network and identifying key strains: drawing strain positions on the two-dimensional graph according to the strain positions subjected to dimensionality reduction obtained in the step 6), carrying out annotation distinguishing on strains according to clustering results, and finally drawing an interaction relation network among the strains according to the association strength, the association direction and the front-back sequence; and identifying key strains according to the interaction relationship network.
Further, in the step 1), the time point is more than or equal to 4. Metagenomic sequencing platforms and procedures can vary, but species relative abundance information needs to be obtained.
Still further, in step 2), the sum of the relative abundances of all species at all levels in each sample is 100%; the low-frequency species are strains appearing in 5% of samples, and the low-abundance species are the species with the 90 th percentile smaller than 1 in all samples of the species; after filtering, carrying out normalization treatment, and converting the relative abundance information matrix of the intestinal flora into a wide format data form, namely, each row represents one species, each column is a time point, and the columns are ordered according to the time sequence; then, each species is subjected to standardization treatment; low fluctuation species, i.e. species with a standard deviation of 0, are removed.
Further, in the step 3), selecting time sequences of two strains to be analyzed, and analyzing the distance and the path between the two strain sequences by using DTW (delay tolerant W); and (5) visually displaying the sequence similarity degree and the similarity direction. Stretching the two strain sequences according to the path, filling abundance information according to the stretched position, and obtaining a new stretched sequence; the correlation coefficients are the similarity coefficient, the similarity magnitude and the similarity direction.
Further, in step 4), since the correlation is asymmetric, one of the two sequences is required to be used as a reference, the other sequence is translated, the translation corresponds to one correlation coefficient once, and the maximum value of the absolute values in all the correlation coefficients is selected as the correlation degree of the two correlation coefficients; the number of translations ═ (length of translation sequence time-3) × 2+1, the translation sequence retains 3 original time points; the sequence does not translate, namely the original position is 0, the leftward translation takes a negative value, the rightward translation takes a positive value; the maximum value of the correlation coefficient corresponds to translation time, namely translation time of the two sequences, the translation time is 0, and the reference sequence and the translation sequence do not have a front-back sequence; the translation time is a negative value, which indicates that the reference sequence appears first and the translation sequence appears later; the shift time is positive, indicating that the shift sequence occurs first and the reference sequence occurs later.
Further, in step 5), the P value of the similar and related strain pairs is less than 0.05, and the correlation coefficient is more than 0.7.
Still further, in step 7), adding a category identifier and an interaction relationship in the dimension reduction graph: the size of the interaction relation line between every two lines represents the correlation degree, and different line types represent the correlation positive and negative directions; the straight line with an arrow indicates the front-back relationship of the two; if no translation exists, the translation is represented by a straight line without an arrow; connecting different core species according to the interaction relation network by comprehensively considering the number of connecting edges; potential key species in the species interaction network are identified by monitoring species that have earlier appeared to change, as well as inter-species and inter-species distances.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, human body fecal samples of the same individual at different times are collected, gene extraction is carried out on each fecal sample, and intestinal flora information is obtained by sequencing. After pretreatment, strain similarity, correlation and ordinary time are calculated to obtain the flora interaction relation. Further, the flora is clustered and subjected to dimensionality reduction to obtain categories and dimensionality reduction positions, and a flora interaction network diagram is drawn by combining interaction relations to identify key strains, so that intestinal flora interaction network detection is realized, and help and reference are provided for personalized intervention on intestinal flora, improvement of human health and disease intervention.
Drawings
FIG. 1 is a flow chart of a detection method of a dynamic observation device based on human intestinal tract time series metagenome data according to the present invention;
FIG. 2 is a schematic diagram of the strain similarity calculation procedure in example 1;
FIG. 3 is a schematic diagram of the calculation steps of the species correlation and the shift time in example 1;
FIG. 4 is a graph of the abundance of two species of example 1 over time and the abundance after DTW scaling;
FIG. 5 is a plot of the species clustering of example 1;
FIG. 6 is a strain dimensionality reduction graph of example 1;
fig. 7 is a diagram of the microflora interaction network of example 1.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
Example 1
In this embodiment, the dynamic observation device based on the human intestinal tract time series metagenome data may be a personal computer, or may be a terminal device such as a smart phone, a tablet, or a portable computer. The device at least comprises: a processor, a communication bus, a display, and a memory. The processor needs to have functions of sequencing (corresponding to a sequencing apparatus), pre-processing analysis (corresponding to a pre-processing analysis apparatus), and data analysis (corresponding to a data analysis apparatus).
Wherein the storage means comprises at least one type of readable storage medium including flash memory, hard disk, multi-media card, card-type memory, magnetic disk, optical disk, and the like. The memory may be an internal storage unit of the dynamic observation device in some embodiments. The memory may also be an external storage device of the dynamic observation apparatus in other embodiments. The storage device is used for storing a flora interaction relation detection program based on human intestinal tract time sequence metagenome data and various data, such as data of a user query database, sequencing sequence data, interaction network dynamic change and the like.
The processor may be, in some embodiments, a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip that executes program code stored in memory or processes data.
The communication bus is used to enable connection communication between these components.
Optionally, the dynamic observation device may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Wherein the display is used for displaying information processed in the data analysis device and for displaying a visualized user interface.
As shown in fig. 1, a method for detecting a dynamic observation device based on human intestinal tract time series metagenome data includes the following steps:
in this embodiment, the method for detecting a dynamic observation device based on time-series metagenomic data is described by taking intestinal stool from 4 time points continuously collected by a certain individual as a sample, and includes the following steps:
step 1) obtaining the intestinal flora information of each sample and corresponding human body basic information.
Wherein, the intestinal flora information of the sample is the sequence information of the intestinal flora obtained by performing DNA extraction, library construction and sequencing on the fecal sample. Preferably, the quality control and species annotation are carried out on the flora sequence information obtained by sequencing by using a Biobakery analysis flow. The quality control process comprises filtering low-quality sequences, removing host pollution sequences and obtaining high-quality sequence information. And then comparing the obtained sample with a marker gene library to perform species annotation and abundance calculation so as to obtain the relative abundance information of the intestinal flora of each sample. The corresponding human body basic information is obtained through questionnaire or detection.
And 2) preprocessing the intestinal flora information.
The genus level abundance information was used for analysis in this example. A total of 39 genera were detected in 4 stool samples from this individual. The sum of the relative abundances at all genus levels for each sample was 100. Preferably, this example eliminates low-occurrence species that appear in only one sample, and eliminates low-abundance species whose relative abundance is less than 1 at the 90 th percentile. And carrying out normalization treatment after filtering. And converting the relative abundance information matrix of the intestinal flora into a wide-format data form, namely, each row represents one species, each column is a time point, and the columns are ordered in time sequence. Next, normalization processing was performed for each species. Low fluctuation species, i.e. species with a standard deviation of 0, are removed. Finally, 12 candidate species to be analyzed were obtained as shown in Table 1.
TABLE 1
Figure BDA0002627385960000081
And 3) calculating the similarity of strains.
The embodiment selects g __ Faecalibacterium and g __ Veillonella as examples to calculate, and the original fluctuation of the two is shown in the left diagram of FIG. 4. The distance between the two is calculated to be 1.59 by using DTW, the path is [ (0,0), (0,1), (1,2), (2,3), (3,3) ], the path value represents the corresponding time point of the original data, 0 represents T1, 1 represents T2, and the like. And filling the original numerical value according to the time point to obtain a new sequence. The g __ Faecalibacterium new sequence is [ -0.79238, -0.79238,1.64897, -0.05004, -0.80655], while the g __ Veillonella new sequence is [ -0.27073, -1.08839,1.63311, -0.27399, -0.27399 ]. After expansion and contraction as shown in the right drawing of fig. 4. The new sequence was obtained by Pearson test, the correlation coefficient between the two was 0.93, and the P value was 0.02. The results show that the two are strongly and positively correlated. If the Pearson correlation analysis is directly performed on the original sequence, the correlation coefficient between the two is-0.36, the P value is 0.64, and the two are not correlated, as shown in Table 2.
TABLE 2
Figure BDA0002627385960000082
Figure BDA0002627385960000091
And 4) calculating the correlation of strains and the translation time.
The calculation was continued with g __ Faecalibacterium and g __ Veillonella as examples. Firstly, g __ Faecalibacterium is used as a reference sequence, g __ Veillonell is used as a translation sequence, the translation lacks positions, the last digit value is used for filling, and CORT is used for calculating the correlation between the two sequences. Since the sequence time point is 4, the total number of translations is 3, i.e., -1, 0, 1. According to the graph shown in FIG. 3, the calculation results in-1: 0.9691388029517358,0: 0.4929214660039898,1: 0.07678262148695543. And exchanging positions, and calculating once more to obtain { -1: 0.10520904758151502,0: 0.4929214660039898 and 1:0.9710343850436703 }. Comparing the absolute values of all the correlation coefficients, the maximum value is 0.9710343850436703, which corresponds to a translation time of 1, i.e. g __ Faecalibacterium occurs one time earlier than g __ Veillonell.
And 5) calculating the interaction relation of the flora.
And respectively calculating the similarity, the relevance and the translation time of the two strains in time sequence to form a flora interaction relation matrix. And screening similar and related strain pairs according to the Pearson test P value and the CORT correlation coefficient to form a final strain interaction relationship network. Preferably, this example incorporates pairs of species having Pearson test P values less than 0.05 and CORT correlation coefficients greater than 0.7, as shown in Table 3.
TABLE 3
Figure BDA0002627385960000092
Figure BDA0002627385960000101
And 6) carrying out flora clustering and dimensionality reduction calculation.
In this embodiment, the distance between every two strains is calculated by using a euclidean distance-based DTW method to form a strain distance matrix. And (4) clustering the floras by using a strain distance matrix and a hierarchical clustering method, and selecting a clustering number according to different distances. And (4) reducing the dimension by using the strain distance matrix and using a principal coordinate axis analysis method to obtain the coordinate position of each strain after dimension reduction. As shown in table 4, in this embodiment, a hierarchical clustering method is used to select 4 clusters from the distance matrix calculated by DTW. And reducing the dimension to 2 dimensions by using a PCOA method to obtain a two-dimensional coordinate of the strain.
TABLE 4
Figure BDA0002627385960000102
Step 7) visualization of flora interaction network and identification of key strains
Drawing the strain positions on the two-dimensional graph according to the strain positions after dimension reduction, carrying out annotation and distinguishing on strains according to a clustering result, and finally drawing a relationship network between strains according to the association strength, the association direction and the front-back sequence. This example plots the microbiota interaction network as shown in fig. 7. As can be seen from the interaction network diagram, the category 3 and the category 2 are all positively correlated, and the category 3 appears before the category 2, and the species in the category 3 and the category 2 are all positively correlated and have no ordinary time. Whereas g __ Faecalibacterium and g __ Dialister in category 1 are positively correlated with and occur in g __ Ruminococcus and g __ Veillonella, respectively. From the overall distance, category 1 and category 3 are closer, i.e., category 1 is more similar to category 3 and both are farther from category 2. It can be seen that g __ Clostridium, g __ Haemophilus, g __ Copropillus in class 3 and g __ Faecalibacterium, g __ Dialister in class 1 of the interworking network may be key species of the network. Key flora can be tailored to tailor individual flora interaction networks.
The invention provides an observation device based on human intestinal tract time sequence metagenome data.
The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims (8)

1. A dynamic observation device based on time series metagenome data is characterized by comprising:
the collecting device is used for collecting samples of human intestinal excrement at different time points; the sequencing device is used for carrying out gene extraction and sequencing on the collected intestinal flora sample to obtain intestinal flora information; the pretreatment analysis device is used for analyzing the relative abundance information of the intestinal colony to obtain a colony to be analyzed; the data analysis device is used for carrying out strain similarity, strain correlation, strain interaction relation, flora clustering and dimensionality reduction, flora interaction network construction and key strain identification on the colonies to be analyzed; a storage device; a display;
the acquisition device, the sequencing device, the pretreatment analysis device and the data analysis device are sequentially connected from front to back; the storage device is respectively connected with the acquisition device, the preprocessing analysis device and the data analysis device; the display is connected with the data analysis device.
2. The method for detecting a dynamic observation device based on time-series metagenomic data according to claim 1, comprising the steps of:
1) sample acquisition: acquiring human intestinal stool samples of the same individual at different time points and corresponding individual basic information; respectively carrying out gene extraction and sequencing on the intestinal flora of each sample, obtaining the intestinal flora information corresponding to each sample by a reference genome comparison and annotation method, and obtaining the relative abundance information of the intestinal flora;
2) preprocessing intestinal flora information: according to the relative abundance information of the intestinal flora obtained in the step 1), carrying out species filtration to screen out low occurrence frequency and low abundance species; after filtering, carrying out normalization processing, converting the data into a wide format data form, and then carrying out normalization processing on each species; finally, filtering the low fluctuation species to obtain candidate colonies to be analyzed;
3) calculating strain similarity: dividing the candidate colonies to be analyzed obtained in the step 2) into two groups, obtaining time sequences of two strains in each group of colonies, and performing expansion and contraction on the two sequences by using dynamic time warping analysis to obtain new expanded sequences; then, correlation analysis is carried out on the new sequences by using Pearson to obtain correlation coefficients and corresponding P values of the two new sequences;
4) calculating the correlation of strains and translation time: taking one strain sequence X of the two new sequences obtained in the step 3) as a reference, translating the other strain sequence Y from negative to positive according to a unit time interval, and filling the sequence missing position according to a tail value; performing Correlation analysis by using Temporal Correlation Coefficient to obtain a Correlation Coefficient after every translation; taking the Y sequence as a reference, and translating the X sequence to calculate a correlation coefficient; selecting the maximum value of the absolute values in all the correlation coefficients as the correlation degree of the correlation coefficients, wherein the positive and negative values of the corresponding correlation coefficients are the correlation directions, the corresponding translation positions are the translation time of the two sequences, and the positive and negative values indicate the front and back sequence of the two sequences;
5) and (3) calculating the flora interaction relationship: forming a flora interaction relation matrix according to the correlation coefficient of the two bacterial colonies obtained in the step 3) and the correlation coefficient and the translation time between the two bacterial colonies obtained in the step 4); summarizing the P value obtained in the step 3) and the correlation coefficient obtained in the step 4), setting standard values of the P value and the correlation coefficient, and screening out similar and related strain pairs which accord with the standard values to obtain a final flora interaction relationship network;
6) and (3) flora clustering and dimensionality reduction calculation: calculating the distance between every two strains by using a DTW (differential time warping) method based on the Euclidean distance to form a strain distance matrix; clustering the floras by using a hierarchical clustering method by using a strain distance matrix, and selecting a clustering number according to different distances; reducing dimensions by using a strain distance matrix and a principal coordinate axis analysis method to obtain a dimension reduction diagram and coordinate positions of each strain after dimension reduction;
7) the method comprises the following steps of (1) visualizing a flora interaction network and identifying key strains: drawing strain positions on the two-dimensional graph according to the strain positions subjected to dimensionality reduction obtained in the step 6), carrying out annotation distinguishing on strains according to clustering results, and finally drawing an interaction relation network among the strains according to the association strength, the association direction and the front-back sequence; and identifying key strains according to the interaction relationship network.
3. The method for detecting the dynamic observation device based on the time-series metagenomic data as claimed in claim 2, wherein in the step 1), the time point is not less than 4.
4. The method for detecting a dynamic observation device based on time-series metagenomic data as set forth in claim 2, wherein in the step 2), the sum of the relative abundances of all species at all levels in each sample is 100%; the low-frequency species are strains appearing in 5% of samples, and the low-abundance species are the species with the 90 th percentile smaller than 1 in all samples of the species; after filtering, carrying out normalization treatment, and converting the relative abundance information matrix of the intestinal flora into a wide format data form, namely, each row represents one species, each column is a time point, and the columns are ordered according to the time sequence; then, each species is subjected to standardization treatment; low fluctuation species, i.e. species with a standard deviation of 0, are removed.
5. The method for detecting a dynamic observation device based on time series metagenome data according to claim 2, wherein in step 3), the time series of two species to be analyzed are selected, and the distance and path between the two species sequences are analyzed by using DTW; stretching the two strain sequences according to the path, filling abundance information according to the stretched position, and obtaining a new stretched sequence; the correlation coefficients are the similarity coefficient, the similarity magnitude and the similarity direction.
6. The method as claimed in claim 2, wherein in step 4), the translation corresponds to one correlation coefficient at a time, and the maximum value of the absolute values of all the correlation coefficients is selected as the degree of correlation between the two correlation coefficients; the number of translations ═ (length of translation sequence time-3) × 2+1, the translation sequence retains 3 original time points; the sequence does not translate, namely the original position is 0, the leftward translation takes a negative value, the rightward translation takes a positive value; the maximum value of the correlation coefficient corresponds to translation time, namely translation time of the two sequences, the translation time is 0, and the reference sequence and the translation sequence do not have a front-back sequence; the translation time is a negative value, which indicates that the reference sequence appears first and the translation sequence appears later; the shift time is positive, indicating that the shift sequence occurs first and the reference sequence occurs later.
7. The method as claimed in claim 2, wherein in step 5), the P value of the similar and related strain pairs is less than 0.05, and the correlation coefficient is greater than 0.7.
8. The method for detecting a dynamic observation device based on time-series metagenomic data according to claim 2, wherein in step 7), a category identifier and an interaction relationship are added to the dimension-reduced map: the size of the interaction relation line between every two lines represents the correlation degree, and different line types represent the correlation positive and negative directions; the straight line with an arrow indicates the front-back relationship of the two; if no translation exists, the translation is represented by a straight line without an arrow; connecting different core species according to the interaction relation network by comprehensively considering the number of connecting edges; potential key species in the species interaction network are identified by monitoring species that have earlier appeared to change, as well as inter-species and inter-species distances.
CN202010801019.1A 2020-08-11 2020-08-11 Dynamic observation device based on time series metagenome data and detection method thereof Active CN112151117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010801019.1A CN112151117B (en) 2020-08-11 2020-08-11 Dynamic observation device based on time series metagenome data and detection method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010801019.1A CN112151117B (en) 2020-08-11 2020-08-11 Dynamic observation device based on time series metagenome data and detection method thereof

Publications (2)

Publication Number Publication Date
CN112151117A true CN112151117A (en) 2020-12-29
CN112151117B CN112151117B (en) 2023-02-03

Family

ID=73887983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010801019.1A Active CN112151117B (en) 2020-08-11 2020-08-11 Dynamic observation device based on time series metagenome data and detection method thereof

Country Status (1)

Country Link
CN (1) CN112151117B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626442A (en) * 2021-08-25 2021-11-09 李成良 High-efficiency biological information data processing method and system
CN114943056A (en) * 2022-07-25 2022-08-26 天津医科大学总医院 Data processing method and device for bacterial interaction relationship in vaginal microecology
CN115116542A (en) * 2022-07-04 2022-09-27 厦门大学 Metagenome-based sample specific species interaction network construction method and system
WO2024077533A1 (en) * 2022-10-12 2024-04-18 深圳华大基因科技服务有限公司 Method and system for constructing dynamic gene regulatory network, and computer device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108078540A (en) * 2016-11-23 2018-05-29 中国科学院昆明动物研究所 Based on human flora's interaction network analysis and evaluation body health and the method to diagnose the illness
KR20190025180A (en) * 2017-08-31 2019-03-11 주식회사 이노아이엔씨 Application method of gut microbiome analysis for animal health monitoring
CN111161794A (en) * 2018-12-30 2020-05-15 深圳碳云智能数字生命健康管理有限公司 Intestinal microorganism sequencing data processing method and device, storage medium and processor
CN111261231A (en) * 2019-12-03 2020-06-09 康美华大基因技术有限公司 Construction method, analysis method and device of intestinal flora metagenome database
CN111415705A (en) * 2020-02-26 2020-07-14 康美华大基因技术有限公司 Method and medium for making related intestinal flora detection report
CN111462819A (en) * 2020-02-26 2020-07-28 康美华大基因技术有限公司 Method for analyzing intestinal microorganism detection data, automatic interpretation system and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108078540A (en) * 2016-11-23 2018-05-29 中国科学院昆明动物研究所 Based on human flora's interaction network analysis and evaluation body health and the method to diagnose the illness
KR20190025180A (en) * 2017-08-31 2019-03-11 주식회사 이노아이엔씨 Application method of gut microbiome analysis for animal health monitoring
CN111161794A (en) * 2018-12-30 2020-05-15 深圳碳云智能数字生命健康管理有限公司 Intestinal microorganism sequencing data processing method and device, storage medium and processor
CN111261231A (en) * 2019-12-03 2020-06-09 康美华大基因技术有限公司 Construction method, analysis method and device of intestinal flora metagenome database
CN111415705A (en) * 2020-02-26 2020-07-14 康美华大基因技术有限公司 Method and medium for making related intestinal flora detection report
CN111462819A (en) * 2020-02-26 2020-07-28 康美华大基因技术有限公司 Method for analyzing intestinal microorganism detection data, automatic interpretation system and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡海兵等: "基于高通量测序技术的冠心病患者肠道菌群多样性研究", 《上海交通大学学报(农业科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626442A (en) * 2021-08-25 2021-11-09 李成良 High-efficiency biological information data processing method and system
CN113626442B (en) * 2021-08-25 2024-02-27 深圳市前海高新国际医疗管理有限公司 High-efficiency biological information data processing method and system
CN115116542A (en) * 2022-07-04 2022-09-27 厦门大学 Metagenome-based sample specific species interaction network construction method and system
CN114943056A (en) * 2022-07-25 2022-08-26 天津医科大学总医院 Data processing method and device for bacterial interaction relationship in vaginal microecology
CN114943056B (en) * 2022-07-25 2022-10-21 天津医科大学总医院 Data processing method and device for bacterial interaction relationship in vaginal microecology
WO2024077533A1 (en) * 2022-10-12 2024-04-18 深圳华大基因科技服务有限公司 Method and system for constructing dynamic gene regulatory network, and computer device

Also Published As

Publication number Publication date
CN112151117B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN112151117B (en) Dynamic observation device based on time series metagenome data and detection method thereof
Scholz et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics
Ellrott et al. Expansion of the protein repertoire in newly explored environments: human gut microbiome specific protein families
US9348799B2 (en) Forming a master page for an electronic document
CN111710364B (en) Method, device, terminal and storage medium for acquiring flora marker
CN112151118B (en) Multi-time-sequence intestinal flora data analysis process control method
CN111192630B (en) Metagenomic data mining method
CN105701501B (en) A kind of trademark image recognition methods
US20170147744A1 (en) System for analyzing sequencing data of bacterial strains and method thereof
CN109727644B (en) Venn diagram making method and system based on microbial genome second-generation sequencing data
CN116469462A (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
US20230352119A1 (en) Method and system for subsampling of cells from single-cell genomics dataset
US20220254446A1 (en) Method for de novo detection, identification and fine mapping of multiple forms of nucleic acid modifications
Dhariwal Statistical, visual and functional analysis of microbiome data
CN117393171B (en) Method and system for constructing prediction model of LARS development track after rectal cancer operation
Ye et al. RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal Sentiment Classification
Sengupta et al. Classification and identification of fungal sequences using characteristic restriction endonuclease cut order
CN117116432B (en) Disease characteristic processing device and equipment
RU2742003C1 (en) Method and system for correcting undesirable batch effects in microbiome data
Deek et al. Statistical and computational methods for integrating microbiome, host genomics, and metabolomics data
CN117275657A (en) Weight management effect prediction method based on intestinal fungus transplantation and application of genus
Radwan et al. Systems Metagenomics: applying systems biology thinking to human microbiome analysis
豊間根耕地 Studies on identification and evaluation of CRISPR diversity on human skin microbiome for development of a new personal identification method
CN116682491A (en) Data processing method for immunogenicity analysis of gene therapy products
Yuan To Study Gene Regulatory Mechanisms in Muscle Cells by Integrated Genomic Approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant