CN118072825A

CN118072825A - Method for identifying microorganisms in soil and analyzing interaction

Info

Publication number: CN118072825A
Application number: CN202410160314.1A
Authority: CN
Inventors: 江晓
Original assignee: Shandong Henghao Information Technology Co ltd
Current assignee: Beijing Huaqing Kechuang Technology Service Co ltd
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-05-24
Anticipated expiration: 2044-02-04
Also published as: CN118072825B

Abstract

The invention provides a method for identifying microorganisms and analyzing interactions in soil, which comprises the steps of obtaining a microorganism DNA sequence, setting up a sequence quality scoring function, and carrying out sequence pretreatment; deeply analyzing biological properties of the microorganism DNA sequences, extracting sequence characteristics, and clustering the sequences by adopting a fusion method based on a mixed element heuristic algorithm; constructing a probability graph model, carrying out microorganism identification based on a DNA sequence, designing a microorganism interaction network construction algorithm, and quantitatively analyzing the interaction among microorganisms; the method solves the problems that the prior art lacks the capability of deeply analyzing the DNA sequence of the microorganism, cannot comprehensively extract and analyze the multidimensional characteristics of the microorganism, can more accurately and comprehensively reveal the multidimensional characteristics, interaction and community structure of the microorganism, and provides important theoretical basis and experimental data for the research of microorganism ecology and environmental science.

Description

A method for identifying and analyzing microorganisms in soil

技术领域Technical Field

本发明涉及微生物识别领域，尤其涉及一种土壤中微生物识别及相互作用分析方法。The present invention relates to the field of microorganism identification, and in particular to a method for identifying microorganisms in soil and analyzing their interactions.

背景技术Background technique

土壤是一个复杂的生态系统，其中微生物群落扮演着至关重要的角色，对土壤的健康、肥力和生态功能有着深远的影响。微生物相互作用、多样性和功能性是土壤生态研究的核心内容。然而，由于土壤微生物的种类繁多、相互作用复杂，传统的微生物研究方法往往难以准确、全面地揭示微生物的真实状态和相互作用。Soil is a complex ecosystem in which microbial communities play a vital role and have a profound impact on soil health, fertility and ecological functions. Microbial interactions, diversity and functionality are the core content of soil ecological research. However, due to the wide variety of soil microorganisms and complex interactions, traditional microbial research methods often fail to accurately and comprehensively reveal the true state and interactions of microorganisms.

随着分子生物学和计算生物学的发展，高通量测序技术为微生物群落的研究提供了新的工具，能够在未培养的情况下直接从环境样本中获取大量的微生物DNA序列信息。然而，如何从这些海量的数据中准确地识别微生物、分析其相互作用以及揭示其在土壤中的功能仍然是一个巨大的挑战。With the development of molecular biology and computational biology, high-throughput sequencing technology has provided new tools for the study of microbial communities, which can directly obtain a large amount of microbial DNA sequence information from environmental samples without cultivation. However, how to accurately identify microorganisms, analyze their interactions, and reveal their functions in soil from these massive data remains a huge challenge.

我国专利申请号：CN202110420635.7，公开日：2021.11.16，公开了一种土壤中玉米秸秆碳同化关键微生物的识别方法，旨在解决现有技术中无法种属水平上识别或回答秸秆碳同化的关键微生物类群及其在秸秆腐解中的演替规律的技术问题。本发明基于稳定同位素探针技术，以分离出的13C标记的微生物DNA为基础，选取各OTU代表序列进行分类学分析，并在各个水平统计分析群落组成，再进行玉米秸秆碳同化细菌群落的α-/β-多样性、共现性网络分析，以识别鉴定出秸秆碳同化关键微生物类群及其群落演替规律，进而揭示目标土壤中秸秆分解的微生物学机制，据此可进一步研发出有针对性的高效玉米秸秆腐熟剂，以促进秸秆的快速腐熟分解，进而起到改良土壤、节肥增效的目的。my country's patent application number: CN202110420635.7, publication date: 2021.11.16, discloses a method for identifying key microorganisms for corn straw carbon assimilation in soil, aiming to solve the technical problem that the key microbial groups for straw carbon assimilation and their succession laws in straw decomposition cannot be identified or answered at the species level in the prior art. The present invention is based on stable isotope probe technology, based on the isolated 13C-labeled microbial DNA, selects each OTU representative sequence for taxonomic analysis, and statistically analyzes the community composition at each level, and then performs α-/β-diversity and co-occurrence network analysis of the corn straw carbon assimilation bacterial community to identify and identify the key microbial groups for straw carbon assimilation and their community succession laws, and then reveal the microbiological mechanism of straw decomposition in the target soil. Based on this, targeted and efficient corn straw composting agents can be further developed to promote the rapid decomposition of straw, thereby achieving the purpose of improving soil, saving fertilizer and increasing efficiency.

但本申请发明人在实现本申请实施例中发明技术方案的过程中，发现上述技术至少存在如下技术问题：现有技术缺乏深度分析微生物DNA序列的能力，无法全面提取和分析微生物的多维度特征，导致对微生物的生物学性质和功能特性的理解不够深入和全面；缺乏有效的定量分析微生物相互作用的方法，没有充分利用多维度网络分析来揭示微生物群落的多维度结构和功能特性，限制了对土壤微生物生态系统的深入理解。However, in the process of implementing the technical solutions of the invention in the embodiments of the present application, the inventors of the present application found that the above-mentioned technology has at least the following technical problems: the existing technology lacks the ability to deeply analyze microbial DNA sequences, and is unable to comprehensively extract and analyze the multidimensional characteristics of microorganisms, resulting in an insufficiently in-depth and comprehensive understanding of the biological properties and functional characteristics of microorganisms; there is a lack of effective methods for quantitatively analyzing microbial interactions, and multidimensional network analysis is not fully utilized to reveal the multidimensional structure and functional characteristics of microbial communities, which limits the in-depth understanding of soil microbial ecosystems.

发明内容Summary of the invention

本申请实施例通过提供一种土壤中微生物识别及相互作用分析方法，解决了现有技术缺乏深度分析微生物DNA序列的能力，无法全面提取和分析微生物的多维度特征，导致对微生物的生物学性质和功能特性的理解不够深入和全面；缺乏有效的定量分析微生物相互作用的方法，没有充分利用多维度网络分析来揭示微生物群落的多维度结构和功能特性，限制了对土壤微生物生态系统的深入理解。实现了一种先进的土壤中微生物识别及相互作用分析方法，通过深度学习和多维度网络分析，能够更精确、全面地揭示微生物的多维度特性、相互作用和群落结构，为微生物生态学和环境科学研究提供了重要的理论依据和实验数据。The embodiment of the present application provides a method for identifying and analyzing microorganisms in soil and their interactions, which solves the problem that the prior art lacks the ability to deeply analyze microbial DNA sequences, cannot fully extract and analyze the multidimensional characteristics of microorganisms, and leads to insufficient in-depth and comprehensive understanding of the biological properties and functional characteristics of microorganisms; lacks effective methods for quantitatively analyzing microbial interactions, and does not fully utilize multidimensional network analysis to reveal the multidimensional structure and functional characteristics of microbial communities, which limits the in-depth understanding of soil microbial ecosystems. An advanced method for identifying and analyzing microorganisms in soil and their interactions has been implemented, which can more accurately and comprehensively reveal the multidimensional characteristics, interactions and community structure of microorganisms through deep learning and multidimensional network analysis, and provides important theoretical basis and experimental data for microbial ecology and environmental science research.

本申请提供了一种土壤中微生物识别及相互作用分析方法，具体包括以下技术方案：The present application provides a method for identifying and analyzing microorganisms in soil and their interactions, which specifically includes the following technical solutions:

一种土壤中微生物识别及相互作用分析方法，包括以下步骤：A method for identifying and analyzing microorganisms in soil and their interactions comprises the following steps:

S100：获取微生物DNA序列，设立序列质量评分函数，进行序列预处理；S100: Obtain microbial DNA sequences, establish sequence quality scoring functions, and perform sequence preprocessing;

S200：深入分析微生物DNA序列的生物学性质，提取序列特征，采用基于混合元启发式算法的融合方法对序列进行聚类；S200: In-depth analysis of the biological properties of microbial DNA sequences, extraction of sequence features, and clustering of sequences using a fusion method based on a hybrid meta-heuristic algorithm;

S300：构建概率图模型，基于DNA序列进行微生物识别，设计微生物相互作用网络构建算法，定量分析微生物之间的相互作用。S300: Build probabilistic graphical models, identify microorganisms based on DNA sequences, design algorithms for building microbial interaction networks, and quantitatively analyze interactions between microorganisms.

优选的，所述步骤S100具体包括：Preferably, the step S100 specifically includes:

在序列预处理阶段，设立了序列质量评分函数，该函数综合考虑了序列的变异性、纯度和复杂度，变异性得分衡量了序列中碱基的变异程度，纯度得分衡量了序列中是否存在杂质序列，复杂度得分衡量了序列的复杂程度，在序列预处理后，将预处理后的序列与参考数据库进行比对，剔除与任何已知序列都不匹配的序列。During the sequence preprocessing stage, a sequence quality scoring function was established, which comprehensively considers the variability, purity and complexity of the sequence. The variability score measures the degree of variation of the bases in the sequence, the purity score measures whether there are impurity sequences in the sequence, and the complexity score measures the complexity of the sequence. After sequence preprocessing, the preprocessed sequence is compared with the reference database to eliminate sequences that do not match any known sequence.

优选的，所述步骤S200具体包括：Preferably, the step S200 specifically includes:

基于对序列特征的深入分析和对不同特征的权重的综合考虑，进一步计算序列之间的相似度；考虑到余弦相似度能够衡量特征向量之间的夹角，而加权欧氏距离能够衡量特征向量之间的距离，将这两者融合，衡量序列之间的相似度；通过对不同特征的权重进行优化，进一步提高相似度计算的准确性。Based on the in-depth analysis of sequence features and comprehensive consideration of the weights of different features, the similarity between sequences is further calculated; considering that cosine similarity can measure the angle between feature vectors, and weighted Euclidean distance can measure the distance between feature vectors, the two are combined to measure the similarity between sequences; by optimizing the weights of different features, the accuracy of similarity calculation is further improved.

在优化过程中，采用基于混合元启发式算法的融合方法，所述基于混合元启发式算法的融合方法引入一种基于信息熵的适应度函数，以衡量聚类方案的优劣；还引入一种基于邻域搜索的变异操作，增强算法的搜索能力。During the optimization process, a fusion method based on a hybrid metaheuristic algorithm is adopted. The fusion method based on the hybrid metaheuristic algorithm introduces a fitness function based on information entropy to measure the pros and cons of clustering schemes; and also introduces a mutation operation based on neighborhood search to enhance the search ability of the algorithm.

根据相似度进行初始聚类，设立相似度阈值，若两个序列的相似度小于相似度阈值，则划分为同一类；从初始聚类方案C的邻域中，随机选择一个候选聚类方案C′，计算候选聚类方案和当前初始聚类方案的适应度值，即它们的信息熵，根据模拟退火准则来决定是否接受候选聚类方案C′。Perform initial clustering based on similarity and set a similarity threshold. If the similarity between two sequences is less than the similarity threshold, they are classified into the same category. Randomly select a candidate clustering scheme C′ from the neighborhood of the initial clustering scheme C, calculate the fitness values of the candidate clustering scheme and the current initial clustering scheme, that is, their information entropy, and decide whether to accept the candidate clustering scheme C′ based on the simulated annealing criterion.

若接受候选聚类方案C′的适应度值比初始聚类方案C高，或者满足接受概率准则，则接受候选聚类方案C′作为新的聚类方案；所述接受概率准则的具体内容是：若接受候选聚类方案C′的适应度值比初始聚类方案C小，则以接受概率Paccept(C,C′)的概率去接受候选聚类方案C′。If the fitness value of the accepted candidate clustering scheme C′ is higher than that of the initial clustering scheme C, or meets the acceptance probability criterion, the candidate clustering scheme C′ is accepted as the new clustering scheme; the specific content of the acceptance probability criterion is: if the fitness value of the accepted candidate clustering scheme C′ is smaller than that of the initial clustering scheme C, the candidate clustering scheme C′ is accepted with the probability of acceptance probability Paccept(C,C′).

优选的，所述步骤S300具体包括：Preferably, the step S300 specifically includes:

在概率图模型分类中，父节点集合得到基于对微生物之间的相互关系和依赖的深入分析，构建概率图模型，描述微生物之间的相互关系和依赖；通过学习网络的结构和参数，可以得到每个节点的父节点集合。In the probabilistic graphical model classification, the parent node set is obtained based on an in-depth analysis of the relationships and dependencies between microorganisms, and a probabilistic graphical model is constructed to describe the relationships and dependencies between microorganisms; by learning the structure and parameters of the network, the parent node set of each node can be obtained.

通过利用深度学习模型学习数据的高级表示，该表示能够捕获数据中的复杂模式和结构；每个微生物在每个时间点t的特征向量X_i(t)被转换为新的表征形式Y_i(t)，模型参数θ_g是通过优化算法在大量的训练数据上进行学习，以最小化重构误差。By utilizing deep learning models to learn a high-level representation of the data that can capture the complex patterns and structures in the data; the feature vector _Xi (t) of each microorganism at each time point t is converted into a new representation form _Yi (t), and the model parameters _θg are learned on a large amount of training data through an optimization algorithm to minimize the reconstruction error.

优选的，所述步骤S300具体包括：、Preferably, the step S300 specifically includes:

基于演化动力学，提出了一种动态权重计算方法，利用演化博弈论和动力学系统理论，分析微生物丰度数据的演化动态，从而计算出微生物之间的动态相互作用权重；为了构建微生物相互作用网络，引入一种基于非对称信息准则的阈值设定方法，这一方法计算不同阈值下，相互作用网络的非对称信息，并选择非对称信息最大的阈值作为最优阈值；阈值用于确定网络中的边，即微生物之间的相互作用；通过最大化非对称信息，可以构建一个揭示了微生物之间真实相互作用的网络；Based on evolutionary dynamics, a dynamic weight calculation method is proposed. The evolutionary dynamics of microbial abundance data is analyzed using evolutionary game theory and dynamic system theory, so as to calculate the dynamic interaction weights between microorganisms. In order to construct a microbial interaction network, a threshold setting method based on the asymmetric information criterion is introduced. This method calculates the asymmetric information of the interaction network under different thresholds and selects the threshold with the largest asymmetric information as the optimal threshold. The threshold is used to determine the edges in the network, that is, the interactions between microorganisms. By maximizing the asymmetric information, a network that reveals the true interactions between microorganisms can be constructed.

基于构建的微生物相互作用网络，为了揭示微生物群落的多维度结构和功能特性，提出了一种基于多维度分析的网络优化方法；在不同的维度下提取网络的多维特性，并进行多维度的网络分析，在每个维度d上对网络N进行投影，得到网络在该维度上的投影，应用多维数据分析理论，对每个维度上的网络投影进行分析，以提取网络的多维特性。Based on the constructed microbial interaction network, a network optimization method based on multidimensional analysis was proposed to reveal the multidimensional structure and functional characteristics of the microbial community. The multidimensional characteristics of the network were extracted in different dimensions, and multidimensional network analysis was performed. The network N was projected on each dimension d to obtain the projection of the network on that dimension. The multidimensional data analysis theory was applied to analyze the network projection on each dimension to extract the multidimensional characteristics of the network.

有益效果：Beneficial effects:

本申请实施例中提供的多个技术方案，至少具有如下技术效果或优点：The multiple technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

1、通过设立序列质量评分函数，对原始序列进行质量控制和过滤，去除低质量和非目标序列，从而确保了序列的精确性和可靠性；通过与参考数据库进行比对，剔除与任何已知序列都不匹配的序列，进一步确保了识别和分类的准确性。1. By establishing a sequence quality scoring function, the original sequence is quality controlled and filtered to remove low-quality and non-target sequences, thereby ensuring the accuracy and reliability of the sequence; by comparing with the reference database, sequences that do not match any known sequence are eliminated, further ensuring the accuracy of identification and classification.

2、通过深入分析微生物DNA序列的生物学性质，提取一组多维度的特征，能够更为全面和深入地理解微生物的生物学性质和功能特性；采用基于混合元启发式算法的融合方法对序列进行聚类，结合模拟退火算法和基于信息熵的适应度函数，能够更为高效和准确地进行微生物的聚类和分类。2. By deeply analyzing the biological properties of microbial DNA sequences and extracting a set of multi-dimensional features, we can have a more comprehensive and in-depth understanding of the biological properties and functional characteristics of microorganisms; by clustering the sequences using a fusion method based on a hybrid meta-heuristic algorithm, combined with a simulated annealing algorithm and a fitness function based on information entropy, we can cluster and classify microorganisms more efficiently and accurately.

3、通过构建微生物相互作用网络，能够定量分析微生物之间的相互作用，揭示微生物群落的多维度结构和功能特性，对于深入理解土壤微生物生态系统的结构和功能具有重要意义；提出了一种基于多维度分析的网络优化方法，能够在不同的维度下提取网络的多维特性，并进行多维度的网络分析，从而更为全面和深入地揭示微生物群落的多维度结构和功能特性。3. By constructing a microbial interaction network, we can quantitatively analyze the interactions between microorganisms and reveal the multidimensional structure and functional characteristics of the microbial community, which is of great significance for a deep understanding of the structure and function of the soil microbial ecosystem. A network optimization method based on multidimensional analysis is proposed, which can extract the multidimensional characteristics of the network in different dimensions and perform multidimensional network analysis, thereby revealing the multidimensional structure and functional characteristics of the microbial community in a more comprehensive and in-depth manner.

4、本申请的技术方案能够有效解决现有技术缺乏深度分析微生物DNA序列的能力，无法全面提取和分析微生物的多维度特征，导致对微生物的生物学性质和功能特性的理解不够深入和全面；缺乏有效的定量分析微生物相互作用的方法，没有充分利用多维度网络分析来揭示微生物群落的多维度结构和功能特性，限制了对土壤微生物生态系统的深入理解。并且，上述系统或方法经过了一系列的效果调研，通过验证，最终实现了一种先进的土壤中微生物识别及相互作用分析方法，通过深度学习和多维度网络分析，能够更精确、全面地揭示微生物的多维度特性、相互作用和群落结构，为微生物生态学和环境科学研究提供了重要的理论依据和实验数据。4. The technical solution of this application can effectively solve the problem that the existing technology lacks the ability to deeply analyze microbial DNA sequences, and cannot fully extract and analyze the multidimensional characteristics of microorganisms, resulting in insufficient in-depth and comprehensive understanding of the biological properties and functional characteristics of microorganisms; lacks effective methods for quantitative analysis of microbial interactions, and does not fully utilize multidimensional network analysis to reveal the multidimensional structure and functional characteristics of microbial communities, which limits the in-depth understanding of soil microbial ecosystems. In addition, the above-mentioned system or method has undergone a series of effect surveys and verifications, and finally realized an advanced method for identifying and analyzing microorganisms and interactions in soil. Through deep learning and multidimensional network analysis, it can more accurately and comprehensively reveal the multidimensional characteristics, interactions and community structure of microorganisms, and provide important theoretical basis and experimental data for microbial ecology and environmental science research.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本申请所述的一种土壤中微生物识别及相互作用分析方法流程图。FIG1 is a flow chart of a method for identifying and analyzing microorganisms in soil and their interactions described in the present application.

具体实施方式Detailed ways

本申请实施例通过提供一种土壤中微生物识别及相互作用分析方法，解决了现有技术缺乏深度分析微生物DNA序列的能力，无法全面提取和分析微生物的多维度特征，导致对微生物的生物学性质和功能特性的理解不够深入和全面；缺乏有效的定量分析微生物相互作用的方法，没有充分利用多维度网络分析来揭示微生物群落的多维度结构和功能特性，限制了对土壤微生物生态系统的深入理解。The embodiments of the present application provide a method for identifying and analyzing microorganisms in soil and their interactions, thereby solving the problems that the prior art lacks the ability to deeply analyze microbial DNA sequences and is unable to comprehensively extract and analyze the multidimensional characteristics of microorganisms, resulting in an insufficiently in-depth and comprehensive understanding of the biological properties and functional characteristics of microorganisms; lacks effective methods for quantitatively analyzing microbial interactions, and does not fully utilize multidimensional network analysis to reveal the multidimensional structure and functional characteristics of microbial communities, limiting an in-depth understanding of soil microbial ecosystems.

本申请实施例中的技术方案为解决上述问题，总体思路如下：The technical solution in the embodiment of the present application is to solve the above problems, and the overall idea is as follows:

通过设立序列质量评分函数，对原始序列进行质量控制和过滤，去除低质量和非目标序列，从而确保了序列的精确性和可靠性；通过与参考数据库进行比对，剔除与任何已知序列都不匹配的序列，进一步确保了识别和分类的准确性；通过深入分析微生物DNA序列的生物学性质，提取一组多维度的特征，能够更为全面和深入地理解微生物的生物学性质和功能特性；采用基于混合元启发式算法的融合方法对序列进行聚类，结合模拟退火算法和基于信息熵的适应度函数，能够更为高效和准确地进行微生物的聚类和分类；通过构建微生物相互作用网络，能够定量分析微生物之间的相互作用，揭示微生物群落的多维度结构和功能特性，对于深入理解土壤微生物生态系统的结构和功能具有重要意义；提出了一种基于多维度分析的网络优化方法，能够在不同的维度下提取网络的多维特性，并进行多维度的网络分析，从而更为全面和深入地揭示微生物群落的多维度结构和功能特性。By establishing a sequence quality scoring function, the original sequence is quality controlled and filtered, and low-quality and non-target sequences are removed, thereby ensuring the accuracy and reliability of the sequence; by comparing with the reference database, sequences that do not match any known sequence are eliminated, further ensuring the accuracy of identification and classification; by deeply analyzing the biological properties of microbial DNA sequences and extracting a set of multidimensional features, a more comprehensive and in-depth understanding of the biological properties and functional characteristics of microorganisms can be achieved; a fusion method based on a hybrid metaheuristic algorithm is used to cluster the sequences, combined with a simulated annealing algorithm and a fitness function based on information entropy, which can more efficiently and accurately cluster and classify microorganisms; by constructing a microbial interaction network, the interactions between microorganisms can be quantitatively analyzed, revealing the multidimensional structure and functional characteristics of the microbial community, which is of great significance for a deep understanding of the structure and function of the soil microbial ecosystem; a network optimization method based on multidimensional analysis is proposed, which can extract the multidimensional characteristics of the network in different dimensions and perform multidimensional network analysis, thereby revealing the multidimensional structure and functional characteristics of the microbial community in a more comprehensive and in-depth manner.

为了更好的理解上述技术方案，下面将结合说明书附图以及具体的实施方式对上述技术方案进行详细的说明。In order to better understand the above technical solution, the above technical solution will be described in detail below in conjunction with the accompanying drawings and specific implementation methods.

参照附图1，本申请所述的一种土壤中微生物识别及相互作用分析方法包括以下步骤：Referring to FIG. 1 , the method for identifying and analyzing microorganisms in soil and their interactions described in the present application comprises the following steps:

采集土壤样本，从中提取微生物DNA。利用高通量测序技术，获取微生物DNA序列。在获取原始序列后进行序列预处理，对原始序列进行质量控制和过滤，去除低质量和非目标序列。Soil samples were collected and microbial DNA was extracted from them. Microbial DNA sequences were obtained using high-throughput sequencing technology. After the raw sequences were obtained, sequence preprocessing was performed to control the quality and filter the raw sequences to remove low-quality and non-target sequences.

在序列预处理阶段，设立了序列质量评分函数，该函数综合考虑了序列的变异性、纯度和复杂度，具体公式为：In the sequence preprocessing stage, a sequence quality scoring function was established, which comprehensively considers the variability, purity and complexity of the sequence. The specific formula is:

Q(s_i)＝η·V(s_i)+θ·P(s_i)+ι·H(s_i)Q(s _i )＝η·V(s _i )+θ·P(s _i )+ι·H(s _i )

其中，Q为序列质量评分，s_i为原始序列，η、θ、ι为权重参数，用于调整变异性得分V、纯度得分P和复杂度得分H在总评分中的权重。变异性得分V(s_i)衡量了序列中碱基的变异程度，定义变异性得分函数：Where Q is the sequence quality score, s _i is the original sequence, η, θ, ι are weight parameters used to adjust the weights of the variability score V, purity score P and complexity score H in the total score. The variability score V(s _i ) measures the degree of variation of the bases in the sequence, and defines the variability score function:

其中，N为序列中碱基类型的数量，对于DNA序列，N＝4(A,T,C,G)，分别是腺嘌呤、胸腺嘧啶、胞嘧啶、鸟嘌呤；n_k为序列中第k类碱基的数量；L为序列的长度；和σ²分别为碱基数量的均值和方差。所述变异性得分函数通过统计学分析碱基分布，引入指数项来更精确地衡量序列的变异性。纯度得分P(s_i)衡量了序列中是否存在杂质序列，定义纯度得分函数：Where N is the number of base types in the sequence. For DNA sequences, N = 4 (A, T, C, G), which are adenine, thymine, cytosine, and guanine, respectively. _nk is the number of bases of the kth type in the sequence. L is the length of the sequence. and σ ² are the mean and variance of the number of bases, respectively. The variability score function statistically analyzes the base distribution and introduces an exponential term to more accurately measure the variability of the sequence. The purity score P(s _i ) measures whether there is an impurity sequence in the sequence, and defines the purity score function:

其中，M为可能的杂质序列的数量；p_m为序列中第m类杂质序列的比例；λ为权重参数。所述纯度得分函数通过信息论中的熵计算和Gini系数，更全面地衡量序列的纯度。复杂度得分H(s_i)衡量了序列的复杂程度，定义复杂度得分函数：Where M is the number of possible impurity sequences; p _m is the proportion of the mth impurity sequence in the sequence; λ is the weight parameter. The purity score function measures the purity of the sequence more comprehensively through entropy calculation and Gini coefficient in information theory. The complexity score H(s _i ) measures the complexity of the sequence and defines the complexity score function:

其中，L为序列的长度；f_l为序列中第l个位置的碱基频率；φ为权重参数。Among them, L is the length of the sequence; _fl is the base frequency at the lth position in the sequence; φ is the weight parameter.

所述变异性得分、纯度得分和复杂度得分都是归一化的，范围在[0,1]之间，值越大表示变异性、纯度和复杂度越高。根据实验目的和数据质量要求，设定一个质量评分阈值。只有质量评分高于此阈值的序列才会被保留，低于此阈值的序列将被剔除。例如，设定阈值为0.7，则所有质量评分低于0.7的序列都将被剔除。对剩余的序列进行长度过滤，去除过长或过短的序列。通过这些得分，可以更准确地评估序列的质量，并有效地过滤掉低质量序列，为了确保后续比对和分类的准确性。The variability score, purity score and complexity score are all normalized and range between [0,1]. The larger the value, the higher the variability, purity and complexity. According to the experimental purpose and data quality requirements, a quality score threshold is set. Only sequences with a quality score higher than this threshold will be retained, and sequences below this threshold will be eliminated. For example, if the threshold is set to 0.7, all sequences with a quality score lower than 0.7 will be eliminated. The remaining sequences are length filtered to remove sequences that are too long or too short. Through these scores, the quality of the sequence can be more accurately evaluated, and low-quality sequences can be effectively filtered out to ensure the accuracy of subsequent alignment and classification.

在序列预处理后，将预处理后的序列与参考数据库进行比对，剔除与任何已知序列都不匹配的序列。采用基于DNA序列的微生物识别算法，通过对土壤样本中微生物的DNA序列进行特征提取，实现微生物的准确识别和分类，解决了土壤微生物多样性和准确评估问题。After sequence preprocessing, the preprocessed sequences are compared with the reference database to remove sequences that do not match any known sequences. The DNA sequence-based microbial identification algorithm is used to extract features from the DNA sequences of microorganisms in soil samples to achieve accurate identification and classification of microorganisms, solving the problem of soil microbial diversity and accurate assessment.

通过深入分析微生物DNA序列的生物学性质，提取一组多维度的特征。例如，序列的GC含量、碱基频率、k-mer频率等。设定特征向量为：By deeply analyzing the biological properties of microbial DNA sequences, a set of multi-dimensional features are extracted. For example, the GC content, base frequency, k-mer frequency, etc. of the sequence. The feature vector is set as:

F(s_i)＝[GC(s_i),Freq_A(s_i),Freq_T(s_i),Freq_C(s_i),Freq_G(s_i),…]F(s _i )＝[GC(s _i ),Freq _A (s _i ),Freq _T (s _i ),Freq _C (s _i ),Freq _G (s _i ),…]

其中，F(s_i)表示序列s_i的特征向量，GC(s_i)、Freq_A(s_i)、Freq_T(s_i)、Freq_C(s_i)和Freq_G(s_i)分别表示序列s_i的GC含量和各碱基的频率。每个特征都是生物学知识得到的，例如，GC含量的计算方法为：Where F(s _i ) represents the feature vector of sequence s _i , GC(s _i ), Freq _A (s _i ), Freq _T (s _i ), Freq _C (s _i ) and Freq _G (s _i ) represent the GC content and the frequency of each base of sequence s _i , respectively. Each feature is obtained from biological knowledge. For example, the calculation method of GC content is:

其中，Count_G(s_i)和Count_C(s_i)分别表示序列s_i中碱基G和C的数量，Length(s_i)表示序列s_i的长度。Wherein, Count _G (s _i ) and Count _C (s _i ) represent the number of bases G and C in sequence s _i , respectively, and Length (s _i ) represents the length of sequence s _i .

在特征提取的基础上，基于对序列特征的深入分析和对不同特征的权重的综合考虑，进一步计算序列之间的相似度。考虑到余弦相似度能够衡量特征向量之间的夹角，而加权欧氏距离能够衡量特征向量之间的距离，因此，将这两者融合，可以更全面地衡量序列之间的相似度。通过对不同特征的权重进行优化，进一步提高相似度计算的准确性。因此，相似度计算的公式为：On the basis of feature extraction, the similarity between sequences is further calculated based on the in-depth analysis of sequence features and the comprehensive consideration of the weights of different features. Considering that cosine similarity can measure the angle between feature vectors, and weighted Euclidean distance can measure the distance between feature vectors, the combination of the two can more comprehensively measure the similarity between sequences. By optimizing the weights of different features, the accuracy of similarity calculation can be further improved. Therefore, the formula for similarity calculation is:

其中，S(s_i,s_k)表示序列s_i和序列s_k之间的相似度，ω是余弦相似度和加权欧氏距离的权重参数，ω_m是第m个特征的权重参数，φ_m(s_i)表示序列s_i的第m个特征。Among them, S(s _i ,s _k ) represents the similarity between sequence s _i and sequence s _k , ω is the weight parameter of cosine similarity and weighted Euclidean distance, ω _m is the weight parameter of the mth feature, and φ _m (s _i ) represents the mth feature of sequence s _i .

在相似度计算的基础上，对序列进行聚类。在动态聚类优化中，采用了一种基于混合元启发式算法的融合方法。所述基于混合元启发式算法的融合方法具备模拟退火算法的优点，还引入了一种基于信息熵的适应度函数，以衡量聚类方案的优劣。引入一种基于邻域搜索的变异操作，以增强算法的搜索能力。Based on the similarity calculation, the sequences are clustered. In the dynamic clustering optimization, a fusion method based on a hybrid metaheuristic algorithm is adopted. The fusion method based on the hybrid metaheuristic algorithm has the advantages of the simulated annealing algorithm, and also introduces a fitness function based on information entropy to measure the pros and cons of the clustering scheme. A mutation operation based on neighborhood search is introduced to enhance the search ability of the algorithm.

具体的，首先根据相似度进行初始聚类，设立相似度阈值，若两个序列的相似度小于相似度阈值，则划分为同一类。从初始聚类方案C的邻域中，随机选择一个候选聚类方案C′，计算候选聚类方案和当前初始聚类方案的适应度值，即它们的信息熵，根据模拟退火准则来决定是否接受候选聚类方案C′。若接受候选聚类方案C′的适应度值比初始聚类方案C高，或者满足接受概率准则，则接受候选聚类方案C′作为新的聚类方案。所述接受概率准则的具体内容是：若接受候选聚类方案C′的适应度值比初始聚类方案C小，则以接受概率Paccept(C,C′)的概率去接受候选聚类方案C′。Specifically, firstly, perform initial clustering according to similarity, set a similarity threshold, and if the similarity between two sequences is less than the similarity threshold, they are classified into the same category. Randomly select a candidate clustering scheme C' from the neighborhood of the initial clustering scheme C, calculate the fitness values of the candidate clustering scheme and the current initial clustering scheme, that is, their information entropy, and decide whether to accept the candidate clustering scheme C' according to the simulated annealing criterion. If the fitness value of the accepted candidate clustering scheme C' is higher than that of the initial clustering scheme C, or meets the acceptance probability criterion, the candidate clustering scheme C' is accepted as a new clustering scheme. The specific content of the acceptance probability criterion is: if the fitness value of the accepted candidate clustering scheme C' is smaller than that of the initial clustering scheme C, the candidate clustering scheme C' is accepted with the probability of acceptance probability Paccept(C,C').

所述信息熵适应度函数可以更加细致地考虑每个聚类中的序列分布，可以引入每个聚类中各个序列的出现频率，从而得到更加精确的适应度值，适应度函数表示为：The information entropy fitness function can consider the sequence distribution in each cluster more carefully, and can introduce the frequency of occurrence of each sequence in each cluster, so as to obtain a more accurate fitness value. The fitness function is expressed as:

其中，E(C)表示聚类方案C的信息熵，用于衡量聚类方案的优劣，c是C中的一个聚类，包含了多个序列s_i，p(s_i|c)表示序列s_i在聚类c中的出现概率，可以通过序列在聚类中的频数除以聚类的总序列数来计算。邻域搜索公式为：Among them, E(C) represents the information entropy of clustering scheme C, which is used to measure the quality of clustering schemes. c is a cluster in C, which contains multiple sequences _si . p( _si |c) represents the probability of sequence _si appearing in cluster c, which can be calculated by dividing the frequency of the sequence in the cluster by the total number of sequences in the cluster. The neighborhood search formula is:

C′＝argmin_{C′∈N(C,O)}E(C′)C′＝argmin _{C′∈N(C,O)} E(C′)

其中，C′表示经过邻域搜索后得到的新的聚类方案，N(C,O)表示通过一系列领域操作O得到的聚类方案C的领域，O包括交换、插入和反转等操作的组合，E(C′)表示新的聚类方案C′的信息熵。模拟退火的接受概率公式为：Among them, C' represents the new clustering scheme obtained after neighborhood search, N(C,O) represents the domain of the clustering scheme C obtained through a series of domain operations O, O includes a combination of operations such as exchange, insertion and inversion, and E(C') represents the information entropy of the new clustering scheme C'. The acceptance probability formula of simulated annealing is:

其中，Paccept(C,C′)表示从当前聚类方案C转移到新的聚类方案C′的接受概率，H表示外部场强度，M表示磁化率，这两个参数可以用来调控接受概率，从而更好地控制算法的搜索过程，T表示当前的温度。Among them, Paccept(C,C′) represents the acceptance probability of transferring from the current clustering scheme C to the new clustering scheme C′, H represents the external field strength, and M represents the magnetic susceptibility. These two parameters can be used to regulate the acceptance probability, thereby better controlling the search process of the algorithm, and T represents the current temperature.

在聚类的基础上进行微生物识别和分类。在概率图模型分类中，父节点集合的得到基于对微生物之间的相互关系和依赖的深入分析，通过构建一张概率图模型，可以更为准确地描述微生物之间的相互关系和依赖。例如，可以构建一张贝叶斯网络，其中，节点表示微生物类别，边表示微生物之间的依赖关系。通过学习网络的结构和参数，可以得到每个节点的父节点集合。因此，概率图模型分类的公式为：Microorganisms are identified and classified based on clustering. In the probabilistic graph model classification, the parent node set is obtained based on an in-depth analysis of the relationships and dependencies between microorganisms. By constructing a probabilistic graph model, the relationships and dependencies between microorganisms can be described more accurately. For example, a Bayesian network can be constructed, in which nodes represent microbial categories and edges represent dependencies between microorganisms. By learning the structure and parameters of the network, the parent node set of each node can be obtained. Therefore, the formula for probabilistic graph model classification is:

其中，P(R|C′)表示给定聚类方案C′下微生物类别集合R的概率，r_i表示序列s_i的微生物类别，P(r_i|Pa(r_i))表示给定其父节点Pa(r_i)的条件下，节点r_i的条件概率，Pa(r_i)表示类别r_i的父节点集合。从而实现了微生物的准确识别和分类，提供了一种全新的视角来理解微生物的多样性和丰富性。Among them, P(R|C′) represents the probability of the microbial category set R under a given clustering scheme C _′ , _ri represents the microbial category of sequence _si , P( _ri |Pa( _ri )) represents the conditional probability of node _ri given its parent node Pa( _ri ), and Pa( _ri ) represents the parent node set of category ri. This enables accurate identification and classification of microorganisms, providing a new perspective to understand the diversity and richness of microorganisms.

设计一种微生物相互作用网络构建算法，定量分析微生物之间的相互作用。对输入的微生物特征数据进行了深度生成模型的预处理，将微生物丰度数据转换为一种新的表征形式，以揭示数据中潜在的非线性结构。具体来说，通过利用深度学习模型学习数据的高级表示，该表示能够捕获数据中的复杂模式和结构。每个微生物在每个时间点t的特征向量X_i(t)被转换为新的表征形式Y_i(t)，模型参数θ_g是通过优化算法在大量的训练数据上进行学习，以最小化重构误差，具体公式为：A microbial interaction network construction algorithm is designed to quantitatively analyze the interactions between microorganisms. The input microbial feature data is preprocessed by a deep generative model, and the microbial abundance data is converted into a new representation form to reveal the potential nonlinear structure in the data. Specifically, a high-level representation of the data is learned by using a deep learning model, which can capture the complex patterns and structures in the data. The feature vector _Xi (t) of each microorganism at each time point t is converted into a new representation form _Yi (t), and the model parameter _θg is learned on a large amount of training data through an optimization algorithm to minimize the reconstruction error. The specific formula is:

Y_i(t)＝DeepGen(X_i(t)；θ_g) _Yi (t) = DeepGen( _Xi (t); _θg )

其中，DeepGen表示深度生成模型。Among them, DeepGen represents the deep generation model.

基于演化动力学，提出了一种动态权重计算方法，利用演化博弈论和动力学系统理论，分析微生物丰度数据的演化动态，从而计算出微生物之间的动态相互作用权重。这一权重计算过程可以通过下式表示：Based on evolutionary dynamics, a dynamic weight calculation method is proposed. By using evolutionary game theory and dynamic system theory, the evolutionary dynamics of microbial abundance data is analyzed to calculate the dynamic interaction weights between microorganisms. This weight calculation process can be expressed by the following formula:

W_ij(t)＝sigmoid(β·ED(Y_i(t),Y_j(t),S_i,S_j)+γ) _Wij (t)=sigmoid(β·ED( _Yi (t), _Yj (t), _Si , _Sj )+γ)

其中，sigmoid是激活函数，β是权重调整参数，S_i和S_j分别表示微生物i和j的演化策略参数，ED表示演化动力学模型，用于描述微生物种群的演化过程，γ是偏置项。上述公式是通过对演化动力学的深入研究和多次实验得出的，它能够更准确地反映微生物之间的相互作用强度。Among them, sigmoid is the activation function, β is the weight adjustment parameter, _Si and _Sj represent the evolutionary strategy parameters of microorganisms i and j respectively, ED represents the evolutionary dynamics model, which is used to describe the evolutionary process of the microbial population, and γ is the bias term. The above formula is obtained through in-depth research on evolutionary dynamics and multiple experiments, and it can more accurately reflect the interaction intensity between microorganisms.

为了构建微生物相互作用网络，引入一种基于非对称信息准则的阈值设定方法。这一方法计算不同阈值下，相互作用网络的非对称信息，并选择非对称信息最大的阈值作为最优阈值。这一过程可以用以下公式表示：In order to construct the microbial interaction network, a threshold setting method based on the asymmetric information criterion is introduced. This method calculates the asymmetric information of the interaction network under different thresholds and selects the threshold with the largest asymmetric information as the optimal threshold. This process can be expressed by the following formula:

ε^*＝argmax_ε(Af(ε)+λ·ND(ε))ε ^* = argmax _ε (Af(ε)+λ·ND(ε))

其中，ε^*表示最优阈值，用于构建微生物相互作用网络，ε是阈值，Af(ε)是在阈值ε下，相互作用网络的非对称信息，ND(ε)是在阈值ε下，网络的密度，λ是一个权重参数，用于平衡网络的非对称信息和网络的密度。阈值用于确定网络中的边，即微生物之间的相互作用。通过最大化非对称信息，可以构建一个揭示了微生物之间真实相互作用的网络。Among them, ε ^* represents the optimal threshold for constructing the microbial interaction network, ε is the threshold, Af(ε) is the asymmetric information of the interaction network under the threshold ε, ND(ε) is the density of the network under the threshold ε, and λ is a weight parameter used to balance the asymmetric information of the network and the density of the network. The threshold is used to determine the edges in the network, that is, the interactions between microorganisms. By maximizing the asymmetric information, a network that reveals the true interactions between microorganisms can be constructed.

基于构建的微生物相互作用网络，为了揭示微生物群落的多维度结构和功能特性，提出了一种基于多维度分析的网络优化方法。在不同的维度下提取网络的多维特性，并进行多维度的网络分析。具体来说，在每个维度d上对网络N进行投影，得到网络在该维度上的投影N_d。应用多维数据分析理论，对每个维度上的网络投影进行分析，以提取网络的多维特性。这一过程可以用以下公式表示：Based on the constructed microbial interaction network, a network optimization method based on multidimensional analysis is proposed to reveal the multidimensional structure and functional characteristics of the microbial community. The multidimensional characteristics of the network are extracted in different dimensions, and multidimensional network analysis is performed. Specifically, the network N is projected on each dimension d to obtain the projection N _d of the network on that dimension. Applying the theory of multidimensional data analysis, the network projection on each dimension is analyzed to extract the multidimensional characteristics of the network. This process can be expressed by the following formula:

其中，MDA是多维度分析结果，用于揭示微生物群落的多维度结构和功能特性，φ是各维度的权重参数，可以通过优化算法来学习得到，DA是在维度d上的网络投影的分析结果。从而微生物相互作用网络可以用于深入研究微生物群落的结构和功能，为微生物生态学和环境科学提供重要的理论依据和实验数据。Among them, MDA is the result of multidimensional analysis, which is used to reveal the multidimensional structure and functional characteristics of microbial communities, φ is the weight parameter of each dimension, which can be learned through optimization algorithms, and DA is the analysis result of network projection on dimension d. Therefore, the microbial interaction network can be used to deeply study the structure and function of microbial communities, and provide important theoretical basis and experimental data for microbial ecology and environmental science.

综上所述，便完成了本申请所述的一种土壤中微生物识别及相互作用分析方法。In summary, the method for identifying and analyzing microorganisms in soil and their interactions described in this application has been completed.

上述本申请实施例中的技术方案，至少具有如下的技术效果或优点：The technical solutions in the above embodiments of the present application have at least the following technical effects or advantages:

1、通过设立序列质量评分函数，对原始序列进行质量控制和过滤，去除低质量和非目标序列，从而确保了序列的精确性和可靠性；通过与参考数据库进行比对，剔除与任何已知序列都不匹配的序列，进一步确保了识别和分类的准确性；1. By setting up a sequence quality scoring function, the original sequence is quality controlled and filtered to remove low-quality and non-target sequences, thereby ensuring the accuracy and reliability of the sequence; by comparing with the reference database, sequences that do not match any known sequence are eliminated, further ensuring the accuracy of identification and classification;

2、通过深入分析微生物DNA序列的生物学性质，提取一组多维度的特征，能够更为全面和深入地理解微生物的生物学性质和功能特性；采用基于混合元启发式算法的融合方法对序列进行聚类，结合模拟退火算法和基于信息熵的适应度函数，能够更为高效和准确地进行微生物的聚类和分类；2. By deeply analyzing the biological properties of microbial DNA sequences and extracting a set of multi-dimensional features, we can have a more comprehensive and in-depth understanding of the biological properties and functional characteristics of microorganisms. We use a fusion method based on a hybrid meta-heuristic algorithm to cluster sequences, combined with a simulated annealing algorithm and a fitness function based on information entropy, to cluster and classify microorganisms more efficiently and accurately.

效果调研：Effect research:

本申请的技术方案能够有效解决现有技术缺乏深度分析微生物DNA序列的能力，无法全面提取和分析微生物的多维度特征，导致对微生物的生物学性质和功能特性的理解不够深入和全面；缺乏有效的定量分析微生物相互作用的方法，没有充分利用多维度网络分析来揭示微生物群落的多维度结构和功能特性，限制了对土壤微生物生态系统的深入理解。并且，上述系统或方法经过了一系列的效果调研，通过验证，最终实现了一种先进的土壤中微生物识别及相互作用分析方法，通过深度学习和多维度网络分析，能够更精确、全面地揭示微生物的多维度特性、相互作用和群落结构，为微生物生态学和环境科学研究提供了重要的理论依据和实验数据。The technical solution of this application can effectively solve the problem that the existing technology lacks the ability to deeply analyze microbial DNA sequences, and cannot fully extract and analyze the multidimensional characteristics of microorganisms, resulting in insufficient in-depth and comprehensive understanding of the biological properties and functional characteristics of microorganisms; lacks effective methods for quantitative analysis of microbial interactions, and does not fully utilize multidimensional network analysis to reveal the multidimensional structure and functional characteristics of microbial communities, which limits the in-depth understanding of soil microbial ecosystems. In addition, the above-mentioned system or method has undergone a series of effect surveys and verifications, and finally realized an advanced method for identifying and analyzing microorganisms and interactions in soil. Through deep learning and multidimensional network analysis, it can more accurately and comprehensively reveal the multidimensional characteristics, interactions and community structure of microorganisms, and provide important theoretical basis and experimental data for microbial ecology and environmental science research.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the embodiment of the present invention. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。Although the preferred embodiments of the present invention have been described, those skilled in the art may make other changes and modifications to these embodiments once they have learned the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications that fall within the scope of the present invention.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include these modifications and variations.

Claims

1. A method for identifying and analyzing microorganisms in soil, comprising the following steps:

S100: Obtain microbial DNA sequences, establish sequence quality scoring functions, and perform sequence preprocessing;

S200: In-depth analysis of the biological properties of microbial DNA sequences, extraction of sequence features, and clustering of sequences using a fusion method based on a hybrid meta-heuristic algorithm;

S300: Build probabilistic graphical models, identify microorganisms based on DNA sequences, design algorithms for building microbial interaction networks, and quantitatively analyze interactions between microorganisms.

2. The method for identifying and analyzing microorganisms in soil according to claim 1, wherein step S100 specifically comprises:

During the sequence preprocessing stage, a sequence quality scoring function was established, which comprehensively considers the variability, purity and complexity of the sequence. The variability score measures the degree of variation of the bases in the sequence, the purity score measures whether there are impurity sequences in the sequence, and the complexity score measures the complexity of the sequence. After sequence preprocessing, the preprocessed sequence is compared with the reference database to eliminate sequences that do not match any known sequence.

3. The method for identifying and analyzing microorganisms in soil and their interactions according to claim 1, wherein step S200 specifically comprises:

Based on the in-depth analysis of sequence features and comprehensive consideration of the weights of different features, the similarity between sequences is further calculated; considering that cosine similarity can measure the angle between feature vectors, and weighted Euclidean distance can measure the distance between feature vectors, the two are combined to measure the similarity between sequences; by optimizing the weights of different features, the accuracy of similarity calculation is further improved.

4. The method for identifying and analyzing microorganisms in soil according to claim 3, wherein step S200 specifically comprises:

During the optimization process, a fusion method based on a hybrid metaheuristic algorithm is adopted. The fusion method based on the hybrid metaheuristic algorithm introduces a fitness function based on information entropy to measure the pros and cons of clustering schemes; and also introduces a mutation operation based on neighborhood search to enhance the search ability of the algorithm.

5. The method for identifying and analyzing microorganisms in soil and their interactions according to claim 4, wherein step S200 specifically comprises:

Perform initial clustering based on similarity and set a similarity threshold. If the similarity between two sequences is less than the similarity threshold, they are classified into the same category. Randomly select a candidate clustering scheme C′ from the neighborhood of the initial clustering scheme C, calculate the fitness values of the candidate clustering scheme and the current initial clustering scheme, that is, their information entropy, and decide whether to accept the candidate clustering scheme C′ based on the simulated annealing criterion.

6. A method for identifying and analyzing microorganisms in soil and their interactions according to claim 5, wherein step S200 specifically comprises:

If the fitness value of the accepted candidate clustering scheme C′ is higher than that of the initial clustering scheme C, or meets the acceptance probability criterion, the candidate clustering scheme C′ is accepted as the new clustering scheme; the specific content of the acceptance probability criterion is: if the fitness value of the accepted candidate clustering scheme C′ is smaller than that of the initial clustering scheme C, the candidate clustering scheme C′ is accepted with the probability of acceptance probability Paccept(C, C′).

7. The method for identifying and analyzing microorganisms in soil according to claim 1, wherein step S300 specifically comprises:

In the probabilistic graphical model classification, the parent node set is obtained based on an in-depth analysis of the relationships and dependencies between microorganisms, and a probabilistic graphical model is constructed to describe the relationships and dependencies between microorganisms; by learning the structure and parameters of the network, the parent node set of each node can be obtained.

8. The method for identifying and analyzing microorganisms in soil according to claim 7, wherein step S300 specifically comprises:

By utilizing deep learning models to learn a high-level representation of the data that can capture the complex patterns and structures in the data; the feature vector _Xi (t) of each microorganism at each time point t is converted into a new representation form _Yi (t), and the model parameters _θg are learned on a large amount of training data through an optimization algorithm to minimize the reconstruction error.

9. The method for identifying and analyzing microorganisms in soil and their interactions according to claim 8, wherein step S300 specifically comprises:

Based on evolutionary dynamics, a dynamic weight calculation method is proposed. The evolutionary dynamics of microbial abundance data is analyzed using evolutionary game theory and dynamic system theory, so as to calculate the dynamic interaction weights between microorganisms. In order to construct a microbial interaction network, a threshold setting method based on the asymmetric information criterion is introduced. This method calculates the asymmetric information of the interaction network under different thresholds and selects the threshold with the largest asymmetric information as the optimal threshold. The threshold is used to determine the edges in the network, that is, the interactions between microorganisms. By maximizing the asymmetric information, a network that reveals the true interactions between microorganisms can be constructed.

Based on the constructed microbial interaction network, a network optimization method based on multidimensional analysis was proposed to reveal the multidimensional structure and functional characteristics of the microbial community. The multidimensional characteristics of the network were extracted in different dimensions, and multidimensional network analysis was performed. The network N was projected on each dimension d to obtain the projection of the network on that dimension. The multidimensional data analysis theory was applied to analyze the network projection on each dimension to extract the multidimensional characteristics of the network.