CN118072825A - Method for identifying microorganisms in soil and analyzing interaction - Google Patents
Method for identifying microorganisms in soil and analyzing interaction Download PDFInfo
- Publication number
- CN118072825A CN118072825A CN202410160314.1A CN202410160314A CN118072825A CN 118072825 A CN118072825 A CN 118072825A CN 202410160314 A CN202410160314 A CN 202410160314A CN 118072825 A CN118072825 A CN 118072825A
- Authority
- CN
- China
- Prior art keywords
- sequence
- microorganisms
- microorganism
- network
- analyzing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 244000005700 microbiome Species 0.000 title claims abstract description 125
- 230000003993 interaction Effects 0.000 title claims abstract description 67
- 238000000034 method Methods 0.000 title claims abstract description 55
- 239000002689 soil Substances 0.000 title claims abstract description 40
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 31
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 27
- 230000004071 biological effect Effects 0.000 claims abstract description 16
- 238000007500 overflow downdraw method Methods 0.000 claims abstract description 13
- 238000010276 construction Methods 0.000 claims abstract description 5
- 230000000813 microbial effect Effects 0.000 claims description 50
- 230000006870 function Effects 0.000 claims description 34
- 238000003012 network analysis Methods 0.000 claims description 15
- 229910052799 carbon Inorganic materials 0.000 claims description 14
- 238000005457 optimization Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000004458 analytical method Methods 0.000 claims description 7
- 230000008901 benefit Effects 0.000 claims description 7
- 238000002922 simulated annealing Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 239000012535 impurity Substances 0.000 claims description 5
- 238000007405 data analysis Methods 0.000 claims description 3
- 238000013136 deep learning model Methods 0.000 claims description 3
- 230000008846 dynamic interplay Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000010921 in-depth analysis Methods 0.000 claims description 2
- 230000035772 mutation Effects 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 8
- 230000007613 environmental effect Effects 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000004445 quantitative analysis Methods 0.000 description 5
- 239000010902 straw Substances 0.000 description 5
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 238000000354 decomposition reaction Methods 0.000 description 4
- 238000004141 dimensional analysis Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 229910052760 oxygen Inorganic materials 0.000 description 4
- 238000003908 quality control method Methods 0.000 description 4
- 240000008042 Zea mays Species 0.000 description 3
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 3
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 3
- 235000005822 corn Nutrition 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 108020004414 DNA Proteins 0.000 description 2
- 230000004075 alteration Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 238000012165 high-throughput sequencing Methods 0.000 description 2
- 239000000523 sample Substances 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 230000001580 bacterial effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000035558 fertility Effects 0.000 description 1
- 239000003337 fertilizer Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000002906 microbiologic effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 229910052717 sulfur Inorganic materials 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Genetics & Genomics (AREA)
- Theoretical Computer Science (AREA)
- Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a method for identifying microorganisms and analyzing interactions in soil, which comprises the steps of obtaining a microorganism DNA sequence, setting up a sequence quality scoring function, and carrying out sequence pretreatment; deeply analyzing biological properties of the microorganism DNA sequences, extracting sequence characteristics, and clustering the sequences by adopting a fusion method based on a mixed element heuristic algorithm; constructing a probability graph model, carrying out microorganism identification based on a DNA sequence, designing a microorganism interaction network construction algorithm, and quantitatively analyzing the interaction among microorganisms; the method solves the problems that the prior art lacks the capability of deeply analyzing the DNA sequence of the microorganism, cannot comprehensively extract and analyze the multidimensional characteristics of the microorganism, can more accurately and comprehensively reveal the multidimensional characteristics, interaction and community structure of the microorganism, and provides important theoretical basis and experimental data for the research of microorganism ecology and environmental science.
Description
Technical Field
The invention relates to the field of microorganism identification, in particular to a method for identifying microorganisms and analyzing interaction in soil.
Background
Soil is a complex ecological system in which the microbial community plays a vital role, having profound effects on the health, fertility and ecological function of the soil. Microbial interactions, diversity and functionality are central to soil ecology research. However, due to the wide variety of soil microorganisms and complex interactions, conventional methods of microorganism research often have difficulty in accurately and comprehensively revealing the true status and interactions of microorganisms.
With the development of molecular biology and computational biology, the high-throughput sequencing technology provides a new tool for the research of microbial communities, and can directly acquire a large amount of microbial DNA sequence information from environmental samples under the condition of no culture. However, how to accurately identify microorganisms from these vast amounts of data, analyze their interactions, and reveal their function in the soil remains a great challenge.
Chinese patent application number: CN202110420635.7, publication date: 2021.11.16 discloses a method for identifying key microorganisms for carbon assimilation of corn stalks in soil, and aims to solve the technical problems that key microorganism groups for carbon assimilation of the stalks and succession rules of the key microorganism groups in stalk decomposition cannot be identified or answered at the species level in the prior art. Based on a stable isotope probe technology, the invention selects each OTU representative sequence for taxonomic analysis based on the separated 13C marked microorganism DNA, and performs statistical analysis on community composition at each level, and then performs alpha-/beta-diversity and co-occurrence network analysis on corn straw carbon assimilation bacterial communities to identify and identify key straw carbon assimilation microorganism communities and community succession rules thereof, thereby revealing a microbiological mechanism of straw decomposition in target soil, and further developing a targeted efficient corn straw decomposition agent to promote rapid decomposition of straw, thereby achieving the purposes of improving soil, saving fertilizer and enhancing efficiency.
However, in the process of implementing the technical scheme of the embodiment of the application, the inventor discovers that the above technology has at least the following technical problems: the prior art lacks the capability of deeply analyzing the DNA sequence of the microorganism, and cannot comprehensively extract and analyze the multidimensional characteristics of the microorganism, so that the understanding of the biological properties and the functional characteristics of the microorganism is not deep and comprehensive enough; the lack of an effective method for quantitative analysis of microbial interactions, which does not fully exploit multi-dimensional network analysis to reveal the multi-dimensional structural and functional characteristics of microbial communities, limits the in-depth understanding of the soil microbial ecosystem.
Disclosure of Invention
The embodiment of the application solves the problems that the prior art lacks the capability of deeply analyzing the DNA sequence of the microorganism and cannot comprehensively extract and analyze the multidimensional characteristics of the microorganism, so that the understanding of the biological property and the functional characteristic of the microorganism is not deep and comprehensive; the lack of an effective method for quantitative analysis of microbial interactions, which does not fully exploit multi-dimensional network analysis to reveal the multi-dimensional structural and functional characteristics of microbial communities, limits the in-depth understanding of the soil microbial ecosystem. The advanced method for identifying and analyzing the interaction of the microorganisms in the soil is realized, the multidimensional characteristics, the interaction and the community structure of the microorganisms can be more accurately and comprehensively revealed through deep learning and multidimensional network analysis, and important theoretical basis and experimental data are provided for the research of microorganism ecology and environmental science.
The application provides a method for identifying microorganisms in soil and analyzing interaction, which specifically comprises the following technical scheme:
a method for identifying and analyzing interactions of microorganisms in soil comprises the following steps:
S100: acquiring a microorganism DNA sequence, setting a sequence quality scoring function, and performing sequence pretreatment;
s200: deeply analyzing biological properties of the microorganism DNA sequences, extracting sequence characteristics, and clustering the sequences by adopting a fusion method based on a mixed element heuristic algorithm;
S300: and constructing a probability graph model, carrying out microorganism identification based on the DNA sequence, designing a microorganism interaction network construction algorithm, and quantitatively analyzing the interaction among microorganisms.
Preferably, the step S100 specifically includes:
In the sequence preprocessing stage, a sequence quality scoring function is established, variability, purity and complexity of the sequence are comprehensively considered by the function, variability scores measure the variation degree of bases in the sequence, purity scores measure whether impurity sequences exist in the sequence, complexity scores measure the complexity degree of the sequence, after the sequence preprocessing, the preprocessed sequence is compared with a reference database, and sequences which are not matched with any known sequence are removed.
Preferably, the step S200 specifically includes:
Further calculating the similarity between sequences based on in-depth analysis of the sequence features and comprehensive consideration of weights of different features; considering cosine similarity, the included angle between the feature vectors can be measured, and the weighted Euclidean distance can measure the distance between the feature vectors, and the two are fused to measure the similarity between sequences; the accuracy of similarity calculation is further improved by optimizing weights of different features.
Preferably, the step S200 specifically includes:
in the optimization process, a fusion method based on a mixed element heuristic algorithm is adopted, and the fusion method based on the mixed element heuristic algorithm introduces an fitness function based on information entropy so as to measure the advantages and disadvantages of a clustering scheme; a mutation operation based on neighborhood search is also introduced, so that the searching capability of the algorithm is enhanced.
Preferably, the step S200 specifically includes:
Initial clustering is carried out according to the similarity, a similarity threshold is established, and if the similarity of the two sequences is smaller than the similarity threshold, the two sequences are divided into the same class; and randomly selecting a candidate clustering scheme C 'from the neighborhood of the initial clustering scheme C, calculating the fitness value of the candidate clustering scheme and the current initial clustering scheme, namely the information entropy of the candidate clustering scheme and the current initial clustering scheme, and determining whether to accept the candidate clustering scheme C' according to the simulated annealing criterion.
Preferably, the step S200 specifically includes:
If the adaptability value of the candidate clustering scheme C 'is higher than that of the initial clustering scheme C, or the candidate clustering scheme C' is accepted as a new clustering scheme if the acceptance probability criterion is met; the specific content of the acceptance probability criterion is as follows: if the fitness value of the candidate cluster scheme C ' is smaller than that of the initial cluster scheme C, the candidate cluster scheme C ' is accepted with the probability of accepting the probability Paccept (C, C ').
Preferably, the step S300 specifically includes:
In the classification of the probability graph model, the father node set is deeply analyzed based on the interrelationship and the dependence among microorganisms, and a probability graph model is constructed to describe the interrelationship and the dependence among the microorganisms; by learning the structure and parameters of the network, a set of parent nodes for each node can be obtained.
Preferably, the step S300 specifically includes:
Learning a high-level representation of data by using a deep learning model, the representation being capable of capturing complex patterns and structures in the data; the feature vector X i (t) of each microorganism at each time point t is converted into a new representation Y i (t), and the model parameters θ g are learned over a large amount of training data by an optimization algorithm to minimize reconstruction errors.
Preferably, the step S300 specifically includes: a step of
Based on evolution dynamics, a dynamic weight calculation method is provided, and evolution dynamics of microorganism abundance data is analyzed by utilizing an evolution game theory and a dynamics system theory, so that dynamic interaction weights among microorganisms are calculated; in order to construct a microbial interaction network, a threshold setting method based on an asymmetric information criterion is introduced, asymmetric information of the interaction network under different thresholds is calculated, and a threshold with the largest asymmetric information is selected as an optimal threshold; the threshold is used to determine the edges in the network, i.e. interactions between microorganisms; by maximizing the asymmetric information, a network can be constructed that reveals the true interactions between microorganisms;
Based on the constructed microbial interaction network, in order to reveal the multidimensional structure and the functional characteristics of a microbial community, a network optimization method based on multidimensional analysis is provided; extracting the multidimensional characteristics of the network in different dimensions, carrying out multidimensional network analysis, projecting the network N in each dimension d to obtain the projection of the network in the dimension, and analyzing the network projection in each dimension by applying a multidimensional data analysis theory to extract the multidimensional characteristics of the network.
The beneficial effects are that:
The technical schemes provided by the embodiment of the application have at least the following technical effects or advantages:
1. Quality control and filtering are carried out on the original sequence by setting up a sequence quality scoring function, and low-quality and non-target sequences are removed, so that the accuracy and reliability of the sequence are ensured; by comparing with the reference database, sequences which are not matched with any known sequence are removed, and the accuracy of identification and classification is further ensured.
2. By deeply analyzing the biological properties of the DNA sequences of the microorganisms, a group of multidimensional features are extracted, so that the biological properties and the functional properties of the microorganisms can be more comprehensively and deeply understood; the method for clustering the sequences by adopting the fusion method based on the mixed element heuristic algorithm combines the simulated annealing algorithm and the fitness function based on the information entropy, so that the method for clustering and classifying the microorganisms can be more efficient and accurate.
3. By constructing a microbial interaction network, the interaction among microorganisms can be quantitatively analyzed, the multidimensional structure and the functional characteristics of a microbial community are revealed, and the method has important significance for deeply understanding the structure and the function of a soil microbial ecological system; the network optimization method based on the multi-dimensional analysis is provided, and the multi-dimensional characteristics of the network can be extracted under different dimensions, and the multi-dimensional network analysis is performed, so that the multi-dimensional structure and the functional characteristics of the microbial community are more comprehensively and deeply disclosed.
4. The technical scheme of the application can effectively solve the problems that the prior art lacks the capability of deeply analyzing the DNA sequence of the microorganism, and the multidimensional characteristics of the microorganism cannot be comprehensively extracted and analyzed, so that the understanding of the biological properties and the functional characteristics of the microorganism is not deep and comprehensive enough; the lack of an effective method for quantitative analysis of microbial interactions, which does not fully exploit multi-dimensional network analysis to reveal the multi-dimensional structural and functional characteristics of microbial communities, limits the in-depth understanding of the soil microbial ecosystem. Through verification, the system or the method finally realizes an advanced method for identifying and analyzing microorganisms in soil, can more accurately and comprehensively reveal the multidimensional characteristics, interactions and community structures of the microorganisms through deep learning and multidimensional network analysis, and provides important theoretical basis and experimental data for the research of microorganism ecology and environmental science.
Drawings
FIG. 1 is a flowchart of a method for identifying and analyzing interactions of microorganisms in soil according to the present application.
Detailed Description
The embodiment of the application solves the problems that the prior art lacks the capability of deeply analyzing the DNA sequence of the microorganism and cannot comprehensively extract and analyze the multidimensional characteristics of the microorganism, so that the understanding of the biological property and the functional characteristic of the microorganism is not deep and comprehensive; the lack of an effective method for quantitative analysis of microbial interactions, which does not fully exploit multi-dimensional network analysis to reveal the multi-dimensional structural and functional characteristics of microbial communities, limits the in-depth understanding of the soil microbial ecosystem.
The technical scheme in the embodiment of the application aims to solve the problems, and the overall thought is as follows:
Quality control and filtering are carried out on the original sequence by setting up a sequence quality scoring function, and low-quality and non-target sequences are removed, so that the accuracy and reliability of the sequence are ensured; the sequences which are not matched with any known sequence are removed through comparison with a reference database, so that the accuracy of identification and classification is further ensured; by deeply analyzing the biological properties of the DNA sequences of the microorganisms, a group of multidimensional features are extracted, so that the biological properties and the functional properties of the microorganisms can be more comprehensively and deeply understood; the sequence is clustered by adopting a fusion method based on a mixed element heuristic algorithm, and the clustering and classification of microorganisms can be more efficiently and accurately carried out by combining a simulated annealing algorithm and an fitness function based on information entropy; by constructing a microbial interaction network, the interaction among microorganisms can be quantitatively analyzed, the multidimensional structure and the functional characteristics of a microbial community are revealed, and the method has important significance for deeply understanding the structure and the function of a soil microbial ecological system; the network optimization method based on the multi-dimensional analysis is provided, and the multi-dimensional characteristics of the network can be extracted under different dimensions, and the multi-dimensional network analysis is performed, so that the multi-dimensional structure and the functional characteristics of the microbial community are more comprehensively and deeply disclosed.
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
Referring to fig. 1, the method for identifying and analyzing the interaction of microorganisms in soil according to the present application comprises the following steps:
S100: acquiring a microorganism DNA sequence, setting a sequence quality scoring function, and performing sequence pretreatment;
Soil samples were collected from which microbial DNA was extracted. And obtaining the microorganism DNA sequence by using a high-throughput sequencing technology. And after the original sequence is obtained, performing sequence pretreatment, performing quality control and filtering on the original sequence, and removing low-quality and non-target sequences.
In the sequence preprocessing stage, a sequence quality scoring function is established, wherein the variability, purity and complexity of the sequence are comprehensively considered by the function, and the specific formula is as follows:
Q(si)=η·V(si)+θ·P(si)+ι·H(si)
Wherein Q is the sequence quality score, s i is the original sequence, η, θ, iota are weight parameters for adjusting the weights of variability score V, purity score P, and complexity score H in the overall score. Variability score V (s i) measures the degree of variation of bases in the sequence, defining a variability score function:
Wherein N is the number of base types in the sequence, n=4 (a, T, C, G) for DNA sequences, adenine, thymine, cytosine, guanine respectively; n k is the number of kth base in the sequence; l is the length of the sequence; And σ 2 are the mean and variance of the number of bases, respectively. The variability score function more accurately scales the variability of the sequence by statistically analyzing the base distribution, introducing an exponential term. Purity score P (s i) measures the presence or absence of impurity sequences in the sequence, defining a purity score function:
Wherein M is the number of possible impurity sequences; p m is the proportion of the m-th impurity sequence in the sequence; λ is the weight parameter. The purity score function more fully scales the purity of the sequence by entropy calculation in the information theory and Gini coefficients. The complexity score H (s i) measures the complexity of the sequence, defining a complexity score function:
Wherein L is the length of the sequence; f l is the base frequency at the first position in the sequence; phi is a weight parameter.
The variability score, purity score, and complexity score are all normalized, ranging between [0,1], with larger values representing higher variability, purity, and complexity. A quality score threshold is set according to the experimental purposes and the data quality requirements. Only sequences with quality scores above this threshold will be retained, sequences below this threshold will be culled. For example, if the threshold is set to 0.7, then all sequences with quality scores below 0.7 will be rejected. The remaining sequences are length filtered to remove sequences that are too long or too short. By these scores, the quality of the sequences can be more accurately assessed and low quality sequences effectively filtered out in order to ensure accuracy of subsequent alignment and classification.
After the sequence pretreatment, the pretreated sequence is compared with a reference database, and the sequence which is not matched with any known sequence is removed. By adopting a microorganism identification algorithm based on DNA sequences, the accurate identification and classification of microorganisms are realized by extracting the characteristics of the DNA sequences of the microorganisms in the soil sample, and the problems of diversity and accurate evaluation of the soil microorganisms are solved.
S200: deeply analyzing biological properties of the microorganism DNA sequences, extracting sequence characteristics, and clustering the sequences by adopting a fusion method based on a mixed element heuristic algorithm;
by further analyzing the biological properties of the microbial DNA sequences, a set of multidimensional features are extracted. For example, the GC content of the sequence, the base frequency, the k-mer frequency, etc. The set feature vector is:
F(si)=[GC(si),FreqA(si),FreqT(si),FreqC(si),FreqG(si),…]
Wherein F (s i) represents the eigenvector of the sequence s i, and GC (s i)、FreqA(si)、FreqT(si)、FreqC(si) and Freq G(si) represent the GC content of the sequence s i and the frequency of each base, respectively. Each feature is obtained from biological knowledge, for example, the GC content is calculated by:
Wherein, count G(si) and Count C(si) represent the number of bases G and C in the sequence s i, respectively, and Length (s i) represents the Length of the sequence s i.
Based on the feature extraction, the similarity between sequences is further calculated based on the deep analysis of the sequence features and the comprehensive consideration of the weights of different features. Considering cosine similarity can measure the included angle between feature vectors, and weighted euclidean distance can measure the distance between feature vectors, so that the two can be fused to more comprehensively measure the similarity between sequences. The accuracy of similarity calculation is further improved by optimizing weights of different features. Therefore, the formula of similarity calculation is:
Where S (S i,sk) represents the similarity between sequence S i and sequence S k, ω is the cosine similarity and the weighted parameter of the weighted euclidean distance, ω m is the weighted parameter of the mth feature, Φ m(si) represents the mth feature of sequence S i.
And clustering the sequences on the basis of similarity calculation. In dynamic clustering optimization, a fusion method based on a mixed element heuristic algorithm is adopted. The fusion method based on the mixed element heuristic algorithm has the advantage of simulating an annealing algorithm, and an fitness function based on information entropy is introduced to measure the advantages and disadvantages of a clustering scheme. A variance operation based on neighborhood search is introduced to enhance the search capability of the algorithm.
Specifically, initial clustering is performed according to the similarity, a similarity threshold is established, and if the similarity of the two sequences is smaller than the similarity threshold, the two sequences are classified into the same class. And randomly selecting a candidate clustering scheme C 'from the neighborhood of the initial clustering scheme C, calculating the fitness value of the candidate clustering scheme and the current initial clustering scheme, namely the information entropy of the candidate clustering scheme and the current initial clustering scheme, and determining whether to accept the candidate clustering scheme C' according to the simulated annealing criterion. And if the fitness value of the candidate clustering scheme C 'is higher than that of the initial clustering scheme C, or the probability criterion is met, the candidate clustering scheme C' is accepted as a new clustering scheme. The specific content of the acceptance probability criterion is as follows: if the fitness value of the candidate cluster scheme C ' is smaller than that of the initial cluster scheme C, the candidate cluster scheme C ' is accepted with the probability of accepting the probability Paccept (C, C ').
The information entropy fitness function can more carefully consider the sequence distribution in each cluster, and can introduce the occurrence frequency of each sequence in each cluster, so that a more accurate fitness value is obtained, and the fitness function is expressed as:
E (C) represents the information entropy of the clustering scheme C and is used for measuring the advantages and disadvantages of the clustering scheme C, C is one cluster in the clustering scheme C, and comprises a plurality of sequences s i,p(si |c) which represent the occurrence probability of the sequence s i in the cluster C, and the occurrence probability can be calculated by dividing the frequency of the sequence in the cluster by the total sequence number of the cluster. The neighborhood search formula is:
C′=argminC′∈N(C,O)E(C′)
Wherein C ' represents a new clustering scheme obtained after neighborhood searching, N (C, O) represents the field of the clustering scheme C obtained by a series of field operations O, O comprises the combination of operations such as exchange, insertion and inversion, and E (C ') represents the information entropy of the new clustering scheme C '. The probability formula for acceptance of simulated annealing is:
Wherein Paccept (C, C ') represents the acceptance probability of transitioning from the current clustering scheme C to the new clustering scheme C', H represents the external field strength, M represents the magnetic susceptibility, these two parameters can be used to regulate the acceptance probability, thereby better controlling the search process of the algorithm, and T represents the current temperature.
S300: and constructing a probability graph model, carrying out microorganism identification based on the DNA sequence, designing a microorganism interaction network construction algorithm, and quantitatively analyzing the interaction among microorganisms.
And (5) carrying out microorganism identification and classification on the basis of clustering. In the classification of the probability map model, the acquisition of the father node set is based on the deep analysis of the interrelationship and the dependence among microorganisms, and the interrelationship and the dependence among the microorganisms can be described more accurately by constructing a probability map model. For example, a bayesian network can be constructed in which nodes represent categories of microorganisms and edges represent dependencies between microorganisms. By learning the structure and parameters of the network, a set of parent nodes for each node can be obtained. Thus, the probability map model classification is formulated as:
Where P (R|C ') represents the probability of the set of microorganism classes R given the clustering scheme C', R i represents the microorganism class of the sequence s i, P (R i|Pa(ri)) represents the conditional probability of the node R i given its parent node Pa (R i), pa (R i) represents the parent node set of the class R i. Thus realizing accurate identification and classification of microorganisms and providing a brand new view angle for understanding the diversity and richness of microorganisms.
A microbial interaction network construction algorithm is designed to quantitatively analyze the interaction among microorganisms. The input microbial characteristic data is subjected to preprocessing of a depth generation model, and microbial abundance data is converted into a new characterization form so as to reveal potential nonlinear structures in the data. In particular, by learning a high-level representation of data using a deep learning model, the representation is able to capture complex patterns and structures in the data. The feature vector X i (t) of each microorganism at each time point t is converted into a new characterization form Y i (t), and the model parameter θ g is learned over a large amount of training data by an optimization algorithm to minimize the reconstruction error, and the specific formula is:
Yi(t)=DeepGen(Xi(t);θg)
wherein DeepGen denotes a depth generation model.
Based on evolution dynamics, a dynamic weight calculation method is provided, and evolution dynamics of microorganism abundance data are analyzed by utilizing an evolution game theory and a dynamics system theory, so that dynamic interaction weights among microorganisms are calculated. This weight calculation process can be expressed by the following formula:
Wij(t)=sigmoid(β·ED(Yi(t),Yj(t),Si,Sj)+γ)
Wherein sigmoid is an activation function, beta is a weight adjustment parameter, S i and S j respectively represent evolution strategy parameters of microorganisms i and j, ED represents an evolution dynamics model for describing an evolution process of a microorganism population, and gamma is a bias term. The formula is obtained through intensive research and multiple experiments on evolution dynamics, and can reflect the interaction strength between microorganisms more accurately.
In order to construct a microbial interaction network, a threshold setting method based on an asymmetric information criterion is introduced. The method calculates asymmetric information of the interaction network under different thresholds, and selects the threshold with the largest asymmetric information as the optimal threshold. This process can be expressed by the following formula:
ε*=argmaxε(Af(ε)+λ·ND(ε))
Wherein epsilon * represents an optimal threshold for constructing a microbial interaction network, epsilon is a threshold, af (epsilon) is asymmetric information of the interaction network at the threshold epsilon, ND (epsilon) is density of the network at the threshold epsilon, and lambda is a weight parameter for balancing the asymmetric information of the network and the density of the network. The threshold is used to determine the interactions between edges, i.e. microorganisms, in the network. By maximizing the asymmetric information, a network can be constructed that reveals the true interactions between microorganisms.
Based on the constructed microbial interaction network, in order to reveal the multidimensional structure and the functional characteristics of the microbial community, a network optimization method based on multidimensional analysis is provided. And extracting the multidimensional characteristics of the network in different dimensions, and carrying out multidimensional network analysis. Specifically, the network N is projected in each dimension d, resulting in a projection N d of the network in that dimension. And analyzing the network projection in each dimension by applying a multidimensional data analysis theory to extract multidimensional characteristics of the network. This process can be expressed by the following formula:
MDA is a multi-dimensional analysis result, is used for revealing the multi-dimensional structure and the functional characteristics of the microbial community, phi is a weight parameter of each dimension, can be learned through an optimization algorithm, and DA is an analysis result of network projection on the dimension d. Therefore, the microbial interaction network can be used for deeply researching the structure and the function of a microbial community, and provides important theoretical basis and experimental data for microbial ecology and environmental science.
In conclusion, the method for identifying microorganisms and analyzing interaction in soil is completed.
The technical scheme provided by the embodiment of the application at least has the following technical effects or advantages:
1. Quality control and filtering are carried out on the original sequence by setting up a sequence quality scoring function, and low-quality and non-target sequences are removed, so that the accuracy and reliability of the sequence are ensured; the sequences which are not matched with any known sequence are removed through comparison with a reference database, so that the accuracy of identification and classification is further ensured;
2. By deeply analyzing the biological properties of the DNA sequences of the microorganisms, a group of multidimensional features are extracted, so that the biological properties and the functional properties of the microorganisms can be more comprehensively and deeply understood; the sequence is clustered by adopting a fusion method based on a mixed element heuristic algorithm, and the clustering and classification of microorganisms can be more efficiently and accurately carried out by combining a simulated annealing algorithm and an fitness function based on information entropy;
3. By constructing a microbial interaction network, the interaction among microorganisms can be quantitatively analyzed, the multidimensional structure and the functional characteristics of a microbial community are revealed, and the method has important significance for deeply understanding the structure and the function of a soil microbial ecological system; the network optimization method based on the multi-dimensional analysis is provided, and the multi-dimensional characteristics of the network can be extracted under different dimensions, and the multi-dimensional network analysis is performed, so that the multi-dimensional structure and the functional characteristics of the microbial community are more comprehensively and deeply disclosed.
Effect investigation:
the technical scheme of the application can effectively solve the problems that the prior art lacks the capability of deeply analyzing the DNA sequence of the microorganism, and the multidimensional characteristics of the microorganism cannot be comprehensively extracted and analyzed, so that the understanding of the biological properties and the functional characteristics of the microorganism is not deep and comprehensive enough; the lack of an effective method for quantitative analysis of microbial interactions, which does not fully exploit multi-dimensional network analysis to reveal the multi-dimensional structural and functional characteristics of microbial communities, limits the in-depth understanding of the soil microbial ecosystem. Through verification, the system or the method finally realizes an advanced method for identifying and analyzing microorganisms in soil, can more accurately and comprehensively reveal the multidimensional characteristics, interactions and community structures of the microorganisms through deep learning and multidimensional network analysis, and provides important theoretical basis and experimental data for the research of microorganism ecology and environmental science.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (9)
1. A method for identifying and analyzing interactions of microorganisms in soil, comprising the steps of:
S100: acquiring a microorganism DNA sequence, setting a sequence quality scoring function, and performing sequence pretreatment;
s200: deeply analyzing biological properties of the microorganism DNA sequences, extracting sequence characteristics, and clustering the sequences by adopting a fusion method based on a mixed element heuristic algorithm;
S300: and constructing a probability graph model, carrying out microorganism identification based on the DNA sequence, designing a microorganism interaction network construction algorithm, and quantitatively analyzing the interaction among microorganisms.
2. The method for identifying and analyzing interactions among microorganisms in soil according to claim 1, wherein said step S100 comprises:
In the sequence preprocessing stage, a sequence quality scoring function is established, variability, purity and complexity of the sequence are comprehensively considered by the function, variability scores measure the variation degree of bases in the sequence, purity scores measure whether impurity sequences exist in the sequence, complexity scores measure the complexity degree of the sequence, after the sequence preprocessing, the preprocessed sequence is compared with a reference database, and sequences which are not matched with any known sequence are removed.
3. The method for identifying and analyzing interactions among microorganisms in soil according to claim 1, wherein the step S200 specifically comprises:
Further calculating the similarity between sequences based on in-depth analysis of the sequence features and comprehensive consideration of weights of different features; considering cosine similarity, the included angle between the feature vectors can be measured, and the weighted Euclidean distance can measure the distance between the feature vectors, and the two are fused to measure the similarity between sequences; the accuracy of similarity calculation is further improved by optimizing weights of different features.
4. A method for identifying and analyzing interactions of microorganisms in soil according to claim 3, wherein said step S200 comprises:
in the optimization process, a fusion method based on a mixed element heuristic algorithm is adopted, and the fusion method based on the mixed element heuristic algorithm introduces an fitness function based on information entropy so as to measure the advantages and disadvantages of a clustering scheme; a mutation operation based on neighborhood search is also introduced, so that the searching capability of the algorithm is enhanced.
5. The method for identifying and analyzing interactions among microorganisms in soil according to claim 4, wherein said step S200 comprises:
Initial clustering is carried out according to the similarity, a similarity threshold is established, and if the similarity of the two sequences is smaller than the similarity threshold, the two sequences are divided into the same class; and randomly selecting a candidate clustering scheme C 'from the neighborhood of the initial clustering scheme C, calculating the fitness value of the candidate clustering scheme and the current initial clustering scheme, namely the information entropy of the candidate clustering scheme and the current initial clustering scheme, and determining whether to accept the candidate clustering scheme C' according to the simulated annealing criterion.
6. The method for identifying and analyzing interactions among microorganisms in soil according to claim 5, wherein said step S200 comprises:
If the adaptability value of the candidate clustering scheme C 'is higher than that of the initial clustering scheme C, or the candidate clustering scheme C' is accepted as a new clustering scheme if the acceptance probability criterion is met; the specific content of the acceptance probability criterion is as follows: if the fitness value of the candidate cluster scheme C ' is smaller than that of the initial cluster scheme C, the candidate cluster scheme C ' is accepted with the probability of accepting the probability Paccept (C, C ').
7. The method for identifying and analyzing interactions among microorganisms in soil according to claim 1, wherein said step S300 comprises:
In the classification of the probability graph model, the father node set is deeply analyzed based on the interrelationship and the dependence among microorganisms, and a probability graph model is constructed to describe the interrelationship and the dependence among the microorganisms; by learning the structure and parameters of the network, a set of parent nodes for each node can be obtained.
8. The method for identifying and analyzing interactions among microorganisms in soil according to claim 7, wherein said step S300 comprises:
Learning a high-level representation of data by using a deep learning model, the representation being capable of capturing complex patterns and structures in the data; the feature vector X i (t) of each microorganism at each time point t is converted into a new representation Y i (t), and the model parameters θ g are learned over a large amount of training data by an optimization algorithm to minimize reconstruction errors.
9. The method for identifying and analyzing interactions among microorganisms in soil according to claim 8, wherein said step S300 comprises: a step of
Based on evolution dynamics, a dynamic weight calculation method is provided, and evolution dynamics of microorganism abundance data is analyzed by utilizing an evolution game theory and a dynamics system theory, so that dynamic interaction weights among microorganisms are calculated; in order to construct a microbial interaction network, a threshold setting method based on an asymmetric information criterion is introduced, asymmetric information of the interaction network under different thresholds is calculated, and a threshold with the largest asymmetric information is selected as an optimal threshold; the threshold is used to determine the edges in the network, i.e. interactions between microorganisms; by maximizing the asymmetric information, a network can be constructed that reveals the true interactions between microorganisms;
Based on the constructed microbial interaction network, in order to reveal the multidimensional structure and the functional characteristics of a microbial community, a network optimization method based on multidimensional analysis is provided; extracting the multidimensional characteristics of the network in different dimensions, carrying out multidimensional network analysis, projecting the network N in each dimension d to obtain the projection of the network in the dimension, and analyzing the network projection in each dimension by applying a multidimensional data analysis theory to extract the multidimensional characteristics of the network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410160314.1A CN118072825A (en) | 2024-02-04 | 2024-02-04 | Method for identifying microorganisms in soil and analyzing interaction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410160314.1A CN118072825A (en) | 2024-02-04 | 2024-02-04 | Method for identifying microorganisms in soil and analyzing interaction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118072825A true CN118072825A (en) | 2024-05-24 |
Family
ID=91098394
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410160314.1A Pending CN118072825A (en) | 2024-02-04 | 2024-02-04 | Method for identifying microorganisms in soil and analyzing interaction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118072825A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544429A (en) * | 2012-07-12 | 2014-01-29 | 中国银联股份有限公司 | Anomaly detection device and method for security information interaction |
CN109508385A (en) * | 2018-11-06 | 2019-03-22 | 云南大学 | A kind of character relation analysis method in web page news data based on Bayesian network |
CN110533096A (en) * | 2019-08-27 | 2019-12-03 | 大连大学 | The DNA of multiverse algorithm based on K-means cluster stores Encoding Optimization |
CN112272849A (en) * | 2018-03-22 | 2021-01-26 | 密歇根大学董事会 | Methods and apparatus for analyzing chromatin interaction data |
KR20220062839A (en) * | 2020-11-09 | 2022-05-17 | 두에이아이(주) | Method for determining fetal fraction in maternal sample based on artificial intelligence |
CN115798600A (en) * | 2023-02-03 | 2023-03-14 | 北京灵迅医药科技有限公司 | Genome data analysis method, apparatus, device and storage medium |
CN117422881A (en) * | 2023-11-15 | 2024-01-19 | 山东衡昊信息技术有限公司 | Intelligent centralized security check graph judging system |
-
2024
- 2024-02-04 CN CN202410160314.1A patent/CN118072825A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544429A (en) * | 2012-07-12 | 2014-01-29 | 中国银联股份有限公司 | Anomaly detection device and method for security information interaction |
CN112272849A (en) * | 2018-03-22 | 2021-01-26 | 密歇根大学董事会 | Methods and apparatus for analyzing chromatin interaction data |
CN109508385A (en) * | 2018-11-06 | 2019-03-22 | 云南大学 | A kind of character relation analysis method in web page news data based on Bayesian network |
CN110533096A (en) * | 2019-08-27 | 2019-12-03 | 大连大学 | The DNA of multiverse algorithm based on K-means cluster stores Encoding Optimization |
KR20220062839A (en) * | 2020-11-09 | 2022-05-17 | 두에이아이(주) | Method for determining fetal fraction in maternal sample based on artificial intelligence |
CN115798600A (en) * | 2023-02-03 | 2023-03-14 | 北京灵迅医药科技有限公司 | Genome data analysis method, apparatus, device and storage medium |
CN117422881A (en) * | 2023-11-15 | 2024-01-19 | 山东衡昊信息技术有限公司 | Intelligent centralized security check graph judging system |
Non-Patent Citations (1)
Title |
---|
李莉;: "基于16SrRNA基因高通量测序聚类算法综述", 长春师范大学学报, no. 02, 20 February 2020 (2020-02-20) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102262642B (en) | Web image search engine and realizing method thereof | |
CN110880019A (en) | Method for adaptively training target domain classification model through unsupervised domain | |
US20240038329A1 (en) | Method, apparatus, and computer-readable medium for efficiently optimizing a phenotype with a combination of a generative and a predictive model | |
CN106202999B (en) | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement | |
Nielsen et al. | Likelihood analysis of ongoing gene flow and historical association | |
CN106682454A (en) | Method and device for data classification of metagenome | |
CN111079856A (en) | CSJITL-RVM-based multi-period intermittent process soft measurement modeling method | |
CN109997193B (en) | Method for quantitative analysis of subgroups in specific group | |
CN104463207B (en) | Knowledge autoencoder network and its polarization SAR image terrain classification method | |
CN114093420A (en) | XGboost-based DNA recombination site prediction method | |
CN118072825A (en) | Method for identifying microorganisms in soil and analyzing interaction | |
CN116246705B (en) | Analysis method and device for whole genome sequencing data | |
CN114496089B (en) | Pathogenic microorganism identification method | |
CN113704464B (en) | Construction method and system of time-evaluation composition material corpus based on network news | |
CN112365930B (en) | Method for determining optimal sequence alignment threshold value for gene database | |
CN115392375A (en) | Intelligent evaluation method and system for multi-source data fusion degree | |
CN117316295B (en) | Endocrine disease cell identification method based on cell heterogeneity gene and pathway function | |
CN115841847B (en) | Microorganism information determination and extraction system and method | |
Pardo-Diaz et al. | Generating weighted and thresholded gene coexpression networks using signed distance correlation | |
CN117116351B (en) | Construction method of species identification model based on machine learning algorithm, species identification method and species identification system | |
CN117171676A (en) | Decision tree-based soil microorganism identification analysis method, system and storage medium | |
Tran | Novel Techniques for Single-cell RNA Sequencing Data Imputation and Clustering | |
McLoughlin | Scaling the Shared Identified Differential Expression (SIDEseq) Measure for Massive Scale Single Cell RNA Sequencing Data and Exploring Extension of the Measure to Chromatin Accessibility Co-Assays | |
Michelsen | 2 Paper I | |
CN118155721A (en) | Aquatic organism germplasm resource evaluation method and system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |