CN116895328A - Evolution event detection method and system for modularized gene structure - Google Patents

Evolution event detection method and system for modularized gene structure Download PDF

Info

Publication number
CN116895328A
CN116895328A CN202311150502.8A CN202311150502A CN116895328A CN 116895328 A CN116895328 A CN 116895328A CN 202311150502 A CN202311150502 A CN 202311150502A CN 116895328 A CN116895328 A CN 116895328A
Authority
CN
China
Prior art keywords
node
gene
state
structure module
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311150502.8A
Other languages
Chinese (zh)
Other versions
CN116895328B (en
Inventor
王博千
李北平
任洪广
岳俊杰
靳远
胡明达
赵云祥
王辛
柴子力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Academy of Military Medical Sciences AMMS of PLA
Original Assignee
Academy of Military Medical Sciences AMMS of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Academy of Military Medical Sciences AMMS of PLA filed Critical Academy of Military Medical Sciences AMMS of PLA
Priority to CN202311150502.8A priority Critical patent/CN116895328B/en
Publication of CN116895328A publication Critical patent/CN116895328A/en
Application granted granted Critical
Publication of CN116895328B publication Critical patent/CN116895328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of microbial evolution analysis, and discloses a modularized gene structure-oriented evolution event detection method and system, wherein the method comprises the following steps: constructing a phylogenetic tree based on the target genome sequence; carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method; clustering all nodes of the phylogenetic tree according to the state marks of the genetic structure modules; based on the clustering result, the gene structure module is subjected to vertical evolution and horizontal transfer analysis, so that the detection of the gene evolution event is realized. Based on the concept of the maximum reduction method, the invention marks the states of the genetic structure modules of all nodes of the phylogenetic tree constructed based on the target genome sequence, reduces the computational complexity, improves the accuracy and precision of detecting the genetic evolution event by carrying out cluster analysis on all the nodes of the phylogenetic tree and carrying out vertical evolution and horizontal transfer analysis on the genetic structure modules, and meets the detection requirement of large-scale genetic horizontal transfer event.

Description

Evolution event detection method and system for modularized gene structure
Technical Field
The invention relates to the technical field of microbial evolutionary analysis in bioinformatics, in particular to an evolutionary event detection method and system for a modularized gene structure.
Background
Gene level transfer is a common event in microbial evolution that often results in genetic material acquisition and loss events at the genetic and subgenomic levels of microorganisms, which can provide an important driving force for the differentiation and evolution of prokaryotes, as well as promote genetic diversity within individual bacterial species.
Traditional microbial evolution analysis research is to determine the gene level transfer event of a microorganism in the evolution process by directly observing and comparing gene sequences, so as to determine the evolution process of the microorganism. The gene evolution event detection method needs to compare sequences in every three gene sequence combinations in pairs, and needs to establish a phylogenetic tree for verification for possible recombination events respectively, so that a large amount of manpower, material resources and time are required to be consumed, the operation is complex, the efficiency is low, and the recombination detection task requirements under a large-scale data set cannot be met. In addition, as the long-term recombination sequence is easily covered by the recent recombination sequence, the classical gene evolution event detection method can only detect the recently-occurring gene level transfer event, and can not detect the loss event caused by recombination, and the detection precision is low, so that the application effect in the detection, analysis and research of the microbial gene evolution event is limited.
Therefore, there is a need in the art to develop a detection method and system that overcomes the shortcomings of the prior art methods of low efficiency and poor accuracy, and that can detect gene level transfer events with large-scale data sets.
Disclosure of Invention
The invention provides a modularized gene structure-oriented evolutionary event detection method and system, which are used for solving the defects that the prior art is low in efficiency and poor in precision and cannot meet the detection requirement of gene level transfer events under a large-scale data set, and realizing efficient and accurate gene evolutionary event detection.
The invention provides an evolutionary event detection method oriented to a modularized gene structure, which comprises the following steps:
constructing a phylogenetic tree based on a target genome sequence, wherein the phylogenetic tree is in a binary tree structure;
carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method;
clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
and according to the clustering result, carrying out vertical evolution analysis and horizontal transfer analysis on the gene structure module to realize detection of the gene evolution event.
According to the method for detecting the evolution event oriented to the modularized gene structure, which is provided by the invention, each node of the phylogenetic tree is marked with the state of the gene structure module by using the maximum reduction method, and the method comprises the following steps:
Judging the initial marking state of each node of the phylogenetic tree, wherein the initial marking state comprises two types of modules containing a gene structure and a module not containing the gene structure;
traversing from the leaf node to the root node, and obtaining the update mark state of the father node according to the initial mark state of the child node by using a maximum reduction method when the initial mark state of the father node is judged to be determined;
and backtracking from the root node to the leaf node, and when the initial marking state of the child node is judged to be determined, obtaining the updating marking state of the child node by using the updating marking state of the parent node.
According to the evolutionary event detection method facing to the modularized gene structure, which is provided by the invention, the traversal from the leaf node to the root node is realized, when the initial marking state of the father node is judged to be determined, the update marking state of the father node is obtained according to the initial marking state of the son node by using the maximum reduction method, and the evolutionary event detection method comprises the following steps:
when the initial marking states of the child nodes are the same, the updated marking state of the father node is the same as the initial marking state of the child node;
when the initial marking state is a child node containing a gene structure module and the initial marking state is a child node not containing a gene structure module, the update marking state of a father node is to be determined;
When only one initial marking state is a child node to be determined, the update marking state of the father node is the initial marking state of another child node for determining the initial marking state;
when the initial marking state is the root node to be determined, determining the updating marking state of the root node according to the initial marking states of all nodes of the system occurrence tree.
According to the evolution event detection method facing to the modularized gene structure, when the initial marking state is the root node to be determined, the update marking state of the root node is determined according to the initial marking states of all nodes of the phylogenetic tree, which comprises the following steps:
when the initial marking state is that the number of nodes containing the gene structure module is smaller than the number of nodes not containing the gene structure module, the updating marking state of the root node is that the node not containing the gene structure module;
when the number of nodes with the initial marking state containing the gene structure module is larger than that of nodes without the initial marking state containing the gene structure module, the updating marking state of the root node is the node with the gene structure module;
when the initial marking state is the same as the number of nodes containing the gene structure module and the initial marking state is the number of nodes not containing the gene structure module, the initial marking state is the initial marking state of the nearest node containing the gene structure module or not containing the gene structure module, and the initial marking state is taken as the updating marking state of the root node.
According to the evolution event detection method for the modularized gene structure, which is provided by the invention, each node of the phylogenetic tree is clustered according to the status mark of the gene structure module, specifically:
and clustering nodes with the same update mark states and direct interconnection relations in the phylogenetic tree into the same cluster, so that the update mark states of any nodes in the same cluster are the same, and the update mark states of vertexes of any cluster are different from those of parent nodes.
According to the method for detecting the evolution event of the modularized gene structure, which is provided by the invention, the gene structure module is subjected to vertical evolution analysis and horizontal transfer analysis according to the clustering result, so that the detection of the gene evolution event is realized, and the method comprises the following steps:
when the gene structure module originally contained in the node in the cluster is reserved in the cluster, the gene structure module is represented to be inherited, and a gene vertical evolution event occurs;
when the peak of the cluster contains a gene structure module which is not contained in the father node, the peak of the cluster is represented to obtain the gene structure module through horizontal transfer, and a gene obtaining event occurs;
when the peak of the cluster does not contain the gene structure module contained in the father node, the peak of the cluster is represented to lose the gene structure module through horizontal transfer, and a gene loss event occurs.
The evolution event detection method facing the modularized gene structure provided by the invention further comprises the following steps:
and searching and counting the complex evolutionary information of the multi-gene structural module according to the detection result of the gene evolutionary event.
According to the method for detecting the evolution event oriented to the modularized gene structure, the complex evolution information of the multi-gene structure module is searched and counted according to the detection result of the gene evolution event, and the method specifically comprises the following steps:
according to the detection result of the gene evolution event, integrating the evolution process of all the gene structure modules, counting the number of the gene structure modules obtained and lost by each evolution node, and counting the combination and frequency of the gene structure modules which are simultaneously involved in the evolution process.
According to the modularized gene structure-oriented evolutionary event detection method provided by the invention, the number of the gene structure modules obtained and lost by each evolutionary node is counted, and the method specifically comprises the following steps:
for each node in the phylogenetic tree, the total number of occurrences of the evolution event (gene structure horizontal transfer and vertical evolution result) of all the gene structure modules in that node is counted.
According to the method for detecting the evolution event of the modularized gene structure, which is provided by the invention, the statistics of the combination and the frequency of the gene structure modules participating in the evolution process simultaneously comprises the following steps:
By transversely comparing each gene evolution event (gene structure horizontal transfer and vertical evolution results), finding out all gene structure module binary sets which are always obtained/lost at the same time;
merging binary sets with the same elements by using transitive features;
and iterating the process until no related sets can be combined, and obtaining all elements in each set, namely the gene structure module combination which is always obtained/lost at the same time.
The invention also provides an evolution event detection system facing the modularized gene structure, which comprises:
a construction module for: constructing a phylogenetic tree based on the target genome sequence;
a marking module for: carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method;
a clustering module for: clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
the detection module is used for: and according to the clustering result, carrying out vertical evolution analysis and horizontal transfer analysis on the gene structure module to realize detection of the gene evolution event.
The invention also provides electronic equipment, which comprises a processor and a memory storing a computer program, and is characterized in that the evolution event detection method facing the modularized gene structure is realized by the processor when the computer program is executed.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described modular genetic structure-oriented evolutionary event detection methods.
According to the evolution event detection method and system for the modularized gene structure, provided by the invention, based on the thought of a maximum reduction method, the state of each node of a phylogenetic tree constructed based on a target genome sequence is marked, so that the least number of mark inversion times between a root node and a leaf node in the whole dendrogram is ensured, a basic biological evolution assumption is followed, meanwhile, the calculation complexity and the calculation amount are reduced, and the detection efficiency and the accuracy of the gene evolution event are improved; and then, cluster analysis is carried out on each node of the phylogenetic tree, vertical evolution analysis and horizontal transfer analysis are carried out on the gene structure module, so that the detection precision of the gene evolution event is improved, and the detection requirement of the gene horizontal transfer event under a large-scale data set is met.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following brief description will be given of the drawings used in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of the method for detecting evolutionary events of a modularized gene structure.
FIG. 2 is a second flow chart of the method for detecting evolutionary events in a modular genetic architecture according to the present invention.
FIG. 3 is a schematic structural diagram of an evolutionary event detection system for a modular genetic structure according to the present invention.
Fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions thereof will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, which should not be construed as limiting the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In the description of the present invention, it is to be understood that the terminology used is for the purpose of description only and is not to be interpreted as indicating or implying relative importance.
The method and system for detecting evolutionary events oriented to a modularized gene structure provided by the invention are described below with reference to fig. 1-4.
Fig. 1-2 are schematic flow diagrams of the method for detecting evolutionary events of modular gene structure according to the present invention. Referring to fig. 1, the method for detecting an evolutionary event oriented to a modular genetic structure provided by the present invention may include:
step S110, constructing a phylogenetic tree based on a target genome sequence, wherein the phylogenetic tree is in a binary tree structure;
step S120, marking the states of the genetic structure modules of all nodes of the phylogenetic tree by using a maximum reduction method;
step S130, clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
and step 140, carrying out vertical evolution analysis and horizontal transfer analysis on the gene structure module according to the clustering result to realize detection of the gene evolution event.
It should be noted that, the execution body of the evolution event detection method for a modularized gene structure provided by the invention may be any network side device meeting technical requirements, such as a gene evolution event detection device.
It should be noted that, the evolution event detection method facing to the modularized gene structure provided by the invention supports the efficient gene evolution event detection of various modularized gene structures, such as a fixed-length gene sequence, a complete gene sequence, a protein domain gene sequence and the like, and has wide application scenes and ranges. In this embodiment, a protein domain is used as an example of a gene structural module, but any modularized gene structure can be used in practical application to perform status marking of the gene structural module on each node of the phylogenetic tree, so as to realize detection of the event by the gene.
In step S110, the network-side device constructs a phylogenetic tree based on the target genomic sequence.
Specifically, the network side device may download the genome sequence of the target strain from the internet as the target genome sequence, use GTDB-Tk (GTDB-Tk is a software tool package for grouping objective taxonomies for bacteria and archaebacteria genome according to the genome database taxonomies GTDB) to find and align 120 single copy proteins ubiquitous in the bacterial kingdom in each strain, use IQ-TREE construction software to construct a phylogenetic TREE, and store the phylogenetic TREE in the form of an edge table (Edgelist) to be recorded as a network structure diagramWhere V represents a point set of the network structure diagram, E represents an edge set of the network structure diagram, and the number of nodes in the network structure diagram G is N. Although in the present embodiment, GTDB-Tk and IQ-TREE are used as the TREE building software, any general TREE building standard and TREE building method may be used to build the phylogenetic TREE in practical application, and no limitation is imposed on the TREE building method.
It should be noted that, the phylogenetic tree constructed by the present invention is a binary tree, and each father node has only two child nodes, and the whole text is expressed in this case.
After the phylogenetic tree is constructed, all protein domains contained in each target strain (i.e. phylogenetic leaf nodes) are searched for and used for initializing all leaf nodes in the labeled phylogenetic treeIs a genetic structural module state of (a). Protein domains can be matched by a Pfam database (Pfam database is a collection of protein families, each represented in the form of a multi-sequence alignment and hidden markov model). If any protein domain is contained by leaf node a, then +.>Otherwise, it is->Each leaf node of the phylogenetic tree is provided with an initial marking state, wherein the initial marking state comprises two types of modules containing the genetic structure and modules not containing the genetic structure. The initial marking states of the remaining phylogenetic tree nodes have to be determined a state marking "? ", i.e
In step S120, the network device performs the status marking of the genetic structure module on each node of the phylogenetic tree by using the maximum reduction method.
Specifically, the status markers of the gene structural modules include from bottom to topAnd from top to bottomTwo stages.
On the one hand, from bottom to topIn the phase, traversing from leaf node to root node, first according to step S110 +. >The initial labeling state of the gene structure module state of (2) is based on the ++f of each child node by the customized maximum reduction method CMP shown in the following formula (1)>Iteration of the initial marker state of (a) to obtain parent node +.>Is updated with the flag state.
(1);
In the formula (1), the components are as follows,representing father node +.>Is updated with the flag state->Representing child node->Is marked with a status of the initial flag,/->Representing child node->Is "? "indicates that the initial marking state is to be determined, CMP indicates a customized maximum conclusion method (Customized Maximum Parsimony), wherein +.>
That is, traversing from the leaf node to the root node, when the initial marking state of the parent node is determined to be determined, the updated marking state of the parent node is obtained from the initial marking state of the child node by using the maximum reduction method CMP. Specifically, when the initial marking states of the child nodes are all the same (i.e) When the update mark state of the father node is the same as the initial mark state of the child node(i.e.)>) The method comprises the steps of carrying out a first treatment on the surface of the When there is a child node with an initial marking state of containing a gene structural module and a child node without an initial marking state of not containing a gene structural module, the child node with an initial marking state of to be determined is not present The update flag state of the parent node is to be determined (i.e.)>) The method comprises the steps of carrying out a first treatment on the surface of the When there is only one child node whose initial flag state is to be determined (i.e.)>) At this time, the update flag state of the parent node is the initial flag state of another child node that determines the initial flag state (i.e.)>)。
Wherein if the root nodeThe initial marking state obtained according to the above procedure is "? By following the maximum reduction method, the updated marking status of the root node can be determined according to the initial marking status of all nodes of the phylogenetic tree by the following formula (2).
In the formula (2), the amino acid sequence of the compound,update flag state representing root node, +.>Representing the initial marking state of 0 in the phylogenetic tree, i.e. the number of nodes without genetic building blocks, +.>Representing the initial marker state in the phylogenetic tree as 1, the number of nodes containing the genetic structural modules.
That is, when the initial marking state is the number of nodes containing the gene structural module is smaller than the number of nodes whose initial marking state is the number of nodes not containing the gene structural module (i.e) At this time, the update flag state of the root node is that the gene structural module is not contained (i.e., ++>) The method comprises the steps of carrying out a first treatment on the surface of the When the number of nodes whose initial labeling state is that containing the gene structural module is larger than the number of nodes whose initial labeling state is that do not contain the gene structural module (i.e.) >) At this time, the update flag state of the root node is that it contains the gene structural module (i.e.)>)。
If the number of nodes whose initial marking state is a state containing a genetic structural module is the same as the number of nodes whose initial marking state is a state not containing a genetic structural module (i.e.,) When the system is in operation, the child nodes of the system can be traversed from the root node according to the system generation tree structure by a depth-first (DFS) or breadth-first (BFS) algorithm, and the nearest definite marked node is searched, namely the initial marked state is the nearest node containing the gene structure module or not containing the gene structure module, and the initial marked state is->Update flag state as root node->Make the following
On the other hand, from top to bottomIn the stage, the method sequentially backtracks from the root node to the leaf node, and when the initial marking state of the child node is to be determined, the updating marking state of the parent node is utilized +.>Get update flag status of child node->Make->
Step S120 satisfies the basic principle of the maximum reduction method, ensures that the number of label inversions from the root node to the leaf node in the whole tree diagram is minimum, follows the basic biological evolution hypothesis, and simultaneously greatly improves the detection efficiency of the genetic evolution event.
In order to avoid confusion, the genetic structure flag states of the nodes of the phylogenetic tree processed in step S120 are hereinafter collectively referred to as update flag states.
In step S130, the network side device clusters each node of the phylogenetic tree according to the status flag of the genetic structure module.
Specifically, the network side device may use the connection relation of the phylogenetic tree to invoke the graph traversal algorithm from the root node according to the status marking condition of the genetic structure module in step S120, and cluster the nodes with the same update marking status (same as "0" or "1") and with the direct interconnection relation in the phylogenetic tree into the same cluster C, so that the update marking status of any node in the same cluster is the same, i.e. for any nodeAnd for any one ofAnd the update mark state of the vertex of any cluster is different from that of the parent node.
In this embodiment, the node clustering adopts a top-down process toFor the first cluster->And traversing each child node thereof by using a BSF algorithm or a DFS algorithm. For any node in traversal procedure +.>If node->Update flag state and vertex- >Is the same (i.e. +.>) Then ∈node>And its edge inclusion cluster with parent node +.>While continuing the iterative traversal->Is defined by the respective child node of (a); if node->Update flag state and vertex->Is not the same (i.e. +.>) Then ∈node>The above process is repeated as a new cluster vertex. Thus, a cluster set can be obtained>Including all nodes in the phylogenetic tree, then +.>Optionally two nodes->There is->(the updated label status of any node in the same cluster is the same), while +.>If it exists as a parent node +>There is a need for->(the update flag state of the vertex of any cluster is different from the update flag state of its parent node).
In step S140, the network side device performs vertical evolution analysis and horizontal transfer analysis on the genetic structure module according to the clustering result, so as to detect a genetic evolution event.
In particular, the specific methods of vertical evolution analysis and horizontal transfer analysis may be: for any clusterIf for any->The gene structure module is +.>The vertical evolution (inheritance) is the condition that when the gene structure module originally contained in the node in the cluster is reserved in the cluster, the gene structure module is inherited, and a gene vertical evolution event occurs; at the same time, for the above- >If it clusters the vertex->There is a parent node (must be a flag "0"), then +.>The gene structure module is obtained through horizontal transfer, namely, when the vertexes of the cluster contain the gene structure module which is not contained in the father node, the vertexes of the cluster are represented to obtain the gene structure module through horizontal transfer, and a gene obtaining event occurs; for arbitrary cluster->If for anyAnd its cluster vertex->There is a parent node (must be a flag "1"), then +.>The gene structure module is lost through horizontal transfer, namely, when the vertex of the cluster does not contain the gene structure module contained in the father node, the vertex of the cluster is represented to lose the gene structure module through horizontal transfer, and a gene loss event occurs.
In this example, for each protein domain, a cluster is obtained according to steps S110-S140Vertical evolution occurs inside clusters and horizontal migration occurs at cluster vertices.
First, regarding vertical evolution analysis, according to the clustering result, the network side device can findSo that for any->And arbitrary node->All have->At the same time for any->And arbitrary node->All have->It can then be determined that the protein domain is +. >Is->Vertical evolution processes occur.
Second, events that gain protein domains for horizontal transfer are also directed to the aboveFor any one ofCluster vertices +.>Must->Indicating that the cluster vertex possesses the protein domain; if it has a parent node, according to +.in step S130>Then there isThe parent node representing the cluster vertex does not possess the protein domain, and therefore the cluster vertex obtains the protein domain by a horizontal transfer event.
Finally, for the horizontal transfer loss of protein domain event, then forFor any->Cluster vertices +.>Must->Indicating that the cluster vertex does not possess the protein domain; if it has a parent node, according to +.in step S130>Then there isThe parent node representing the cluster vertex owns the protein domain, and therefore the cluster vertex loses the protein domain by a horizontal transfer event.
Further, referring to fig. 2, the method for detecting a genetic evolutionary event according to the present invention may further include step S150: and searching and counting the complex evolutionary information of the multi-gene structural module according to the detection result of the gene evolutionary event.
It should be noted that, steps S110-S140 have already implemented searching for the evolution information of the genetic structure modules, and step S150 is to use data analysis to deep dig more complex evolution information, involving correlation and statistics on the number of the genetic structure modules.
Specifically, the network side device may count the number of obtained and lost genetic structure modules for each node in the phylogenetic tree; in addition, the network side equipment can also record cluster vertex points of all acquisition and loss events of each gene structure moduleThe transverse comparison statistics are always obtained simultaneously or (and) are combined by the lost gene structure modules, and the occurrence frequency of the gene structure modules is counted.
In this example, for each protein domainIn other words, the cluster vertices of the protein domain obtained by the horizontal transfer event found according to step S140 can be assembled +.>I.e. for any nodeAll have->(if a parent node exists). Thus for all protein domainsThe set +.>. For each node in the phylogenetic tree +.>Find it at +.>The total number of occurrences in each set is then the total number of protein domains obtained by horizontal transfer of the node. Accordingly, the network side device can respectively count the number of protein domains obtained by each node in the phylogenetic tree. The same applies to protein domain loss event analysis.
Second, the search is always performed simultaneously with the acquisition or (and) of the missing gene structural module combination.
Also exemplified by protein domain acquisition events, by lateral collection of each protein domainNode set generating obtained event +.>For each node->Construction of protein domain combinations obtained therefromThe set +.>. From this it is possible to analyze the search binary set, e.g.>So that for any->If->Then there is a need toAnd vice versa. Based on transitivity principle, it is possible to try to combine binary sets searched for all protein domains to generate a larger set, e.g. analysis to obtain binary sets +.>And->The satisfaction is always obtained simultaneously in the node, set +.>The above conditions are also satisfied. And so on until no merger is possible, all complete sets can be found so that protein domain satisfaction in each set is always obtained simultaneously. The same applies to the analysis of protein domain combinations in the event of loss.
The evolution event detection method facing the modularized gene structure provided by the invention has at least the following advantages:
(1) Based on the concept of the maximum reduction method, the method rapidly marks the states of the genetic structure modules of all nodes of the phylogenetic tree through the customized maximum reduction method, ensures that the number of times of mark inversion between root nodes and leaf nodes in the whole tree diagram is minimum, reduces the computational complexity and the calculated amount, improves the detection efficiency and the accuracy of the genetic evolution event, and can realize batch screening of large-scale genome data in a short time aiming at a modularized genetic structure.
(2) All gene evolution events (including lost events caused by recombination, long-term gene evolution events and the like) in the whole evolution process can be comprehensively detected by carrying out cluster analysis on each node of the phylogenetic tree and carrying out vertical evolution analysis and horizontal transfer analysis on the gene structure module, so that the detection precision of the gene evolution events is improved, and the detection requirement of the gene horizontal transfer events under a large-scale data set is met.
(3) Through comprehensively analyzing the evolution process of each gene structure module, searching key evolution/differentiation sites through transverse comparison, counting the number of the gene structure modules obtained and lost by each evolution node, counting the combination and frequency of the gene structure modules simultaneously participating in the evolution process, the detection capability of gene level transfer events can be further enhanced, effective and accurate data are provided for the research of microbial gene evolution analysis, and the research effect of microbial evolution analysis is greatly improved.
(4) The invention simultaneously supports the detection function of acquisition and loss events caused by horizontal transfer of a specific gene structure module in the evolution process.
(5) The invention can utilize the data set related to the time span to predict the gene evolution event after the occurrence of the common ancestor from the data set, thereby detecting the gene evolution event which is relatively long in the gene evolution process.
(6) The invention supports detection of various modularized gene structures, such as fixed-length gene sequences, complete gene sequences, protein domain gene sequences and the like, and has wide application scenes and range.
The system for detecting the evolution event of the modularized gene structure is described below, and the system for detecting the evolution event of the modularized gene structure and the method for detecting the evolution event of the modularized gene structure described below can be correspondingly referred to each other.
Referring to fig. 3, the system for detecting evolutionary events of a modular genetic structure according to the present invention may include:
a construction module 410 for: constructing a phylogenetic tree based on a target genome sequence, wherein the phylogenetic tree is in a binary tree structure;
a marking module 420 for: carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method;
a clustering module 430 for: clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
a detection module 440 for: and according to the clustering result, carrying out vertical evolution analysis and horizontal transfer analysis on the gene structure module to realize detection of the gene evolution event.
According to the present invention, the labeling module 420 may include:
the judging submodule is used for: judging the initial marking state of each node of the phylogenetic tree, wherein the initial marking state comprises a gene structure module and a non-gene structure module;
a first marking sub-module for: traversing from the leaf node to the root node, and obtaining the update mark state of the father node according to the initial mark state of the child node by using a maximum reduction method when the initial mark state of the father node is judged to be determined;
a second marking sub-module for: and backtracking from the root node to the leaf node, and when the initial marking state of the child node is judged to be determined, obtaining the updating marking state of the child node by using the updating marking state of the parent node.
According to the evolutionary event detection system for a modularized gene structure provided by the invention, the first marking submodule can comprise:
a first update sub-module for: when the initial marking states of the child nodes are the same, the updated marking state of the father node is the same as the initial marking state of the child node;
a second update sub-module for: when the initial marking state is a child node containing a gene structure module and the initial marking state is a child node not containing a gene structure module, the update marking state of a father node is to be determined;
A third update sub-module for: when only one initial marking state is a child node to be determined, the update marking state of the father node is the initial marking state of another child node with the determined initial marking state;
a root node update sub-module for: when the initial marking state is the root node to be determined, determining the updating marking state of the root node according to the initial marking states of all nodes of the system occurrence tree.
According to the evolution event detection system facing to the modularized gene structure provided by the invention, the root node updating sub-module can comprise:
a fourth update sub-module for: when the initial marking state is that the number of nodes containing the gene structure module is smaller than the number of nodes not containing the gene structure module, the updating marking state of the root node is that the node not containing the gene structure module;
a fifth update sub-module for: when the number of nodes with the initial marking state containing the gene structure module is larger than that of nodes without the initial marking state containing the gene structure module, the updating marking state of the root node is the node with the gene structure module;
a sixth update sub-module for: when the initial marking state is the same as the number of nodes containing the gene structure module and the initial marking state is the number of nodes not containing the gene structure module, the initial marking state is the initial marking state of the nearest node containing the gene structure module or not containing the gene structure module, and the initial marking state is taken as the updating marking state of the root node.
According to the evolutionary event detection system for a modularized gene structure provided by the invention, the clustering module 430 is specifically configured to: and clustering nodes with the same update mark states and direct interconnection relations in the phylogenetic tree into the same cluster, so that the update mark states of any nodes in the same cluster are the same, and the update mark states of vertexes of any cluster are different from those of parent nodes.
According to the present invention, the detection module 440 may include:
a vertical evolution event detection sub-module for: when the gene structure module originally contained in the node in the cluster is reserved in the cluster, the gene structure module is represented to be inherited, and a gene vertical evolution event occurs;
a gene acquisition event detection sub-module for: when the peak of the cluster contains a gene structure module which is not contained in the father node, the peak of the cluster is represented to obtain the gene structure module through horizontal transfer, and a gene obtaining event occurs;
a gene loss event detection sub-module for: when the peak of the cluster does not contain the gene structure module contained in the father node, the peak of the cluster is represented to lose the gene structure module through horizontal transfer, and a gene loss event occurs.
According to the invention, the evolution event detection system facing the modularized gene structure can further comprise:
the comprehensive analysis module is used for: and searching and counting the complex evolutionary information of the multi-gene structural module according to the detection result of the gene evolutionary event.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a modular genetic structure oriented evolutionary event detection method comprising:
constructing a phylogenetic tree based on a target genome sequence, wherein the phylogenetic tree is in a binary tree structure;
carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method;
clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
based on the clustering result, vertical evolution analysis and horizontal transfer analysis are carried out on the gene structure module, so that the detection of the gene evolution event is realized.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for detecting evolutionary events of a modular genetic structure provided by the above methods, the method comprising:
Constructing a phylogenetic tree based on a target genome sequence, wherein the phylogenetic tree is in a binary tree structure;
carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method;
clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
based on the clustering result, vertical evolution analysis and horizontal transfer analysis are carried out on the gene structure module, so that the detection of the gene evolution event is realized.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. An evolutionary event detection method for a modular genetic structure, comprising:
constructing a phylogenetic tree based on a target genome sequence, wherein the phylogenetic tree is in a binary tree structure;
carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method;
clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
based on the clustering result, vertical evolution analysis and horizontal transfer analysis are carried out on the gene structure module, so that the detection of the gene evolution event is realized.
2. The method for detecting evolutionary events of modular genetic architecture of claim 1, wherein said using maximum reduction method to label the status of genetic architecture modules for each node of the phylogenetic tree comprises:
Judging the initial marking state of each node of the phylogenetic tree, wherein the initial marking state comprises a gene structure module and a non-gene structure module;
traversing from the leaf node to the root node, and obtaining the update mark state of the father node according to the initial mark state of the child node by using a maximum reduction method when the initial mark state of the father node is judged to be determined;
and backtracking from the root node to the leaf node, and when the initial marking state of the child node is judged to be determined, obtaining the updating marking state of the child node by using the updating marking state of the parent node.
3. The method for detecting an evolutionary event oriented to a modular genetic structure according to claim 2, wherein traversing from a leaf node to a root node, when an initial marking state of a parent node is determined to be determined, obtaining an updated marking state of the parent node according to the initial marking state of a child node by using a maximum reduction method comprises:
when the initial marking states of the child nodes are the same, the updated marking state of the father node is the same as the initial marking state of the child node;
when the initial marking state is a child node containing a gene structure module and the initial marking state is a child node not containing a gene structure module, the update marking state of a father node is to be determined;
When only one initial mark state is a child node to be determined, the update mark state of the father node is the initial state mark of another child node with the initial state mark determined;
when the initial marking state is the root node to be determined, determining the updating marking state of the root node according to the initial marking states of all nodes of the system occurrence tree.
4. The method for detecting an evolutionary event oriented to a modular genetic structure according to claim 3, wherein when there is a root node whose initial marker state is to be determined, determining the updated marker state of the root node according to the initial marker states of all nodes of the phylogenetic tree comprises:
when the initial marking state is that the number of nodes containing the gene structure module is smaller than the number of nodes not containing the gene structure module, the updating marking state of the root node is that the node not containing the gene structure module;
when the number of nodes with the initial marking state containing the gene structure module is larger than that of nodes without the initial marking state containing the gene structure module, the updating marking state of the root node is the node with the gene structure module;
when the initial marking state is the same as the number of nodes containing the gene structure module and the initial marking state is the number of nodes not containing the gene structure module, the initial marking state is the initial marking state of the nearest node containing the gene structure module or not containing the gene structure module, and the initial marking state is taken as the updating marking state of the root node.
5. The method for detecting evolutionary events of a modular genetic structure according to any one of claims 2-4, wherein the clustering of nodes of a phylogenetic tree is performed according to a genetic structure module status flag, specifically:
and clustering nodes with the same update mark states and direct interconnection relations in the phylogenetic tree into the same cluster, so that the update mark states of any nodes in the same cluster are the same, and the update mark states of vertexes of any cluster are different from those of parent nodes.
6. The method for detecting an evolutionary event oriented to a modular gene structure according to claim 5, wherein the performing vertical evolutionary analysis and horizontal transfer analysis on the gene structure module based on the clustering result to detect the evolutionary event comprises:
when the gene structure module originally contained in the node in the cluster is reserved in the cluster, the gene structure module is represented to be inherited, and a gene vertical evolution event occurs;
when the peak of the cluster contains a gene structure module which is not contained in the father node, the peak of the cluster is represented to obtain the gene structure module through horizontal transfer, and a gene obtaining event occurs;
When the peak of the cluster does not contain the gene structure module contained in the father node, the peak of the cluster is represented to lose the gene structure module through horizontal transfer, and a gene loss event occurs.
7. The method for detecting an evolutionary event oriented to a modular genetic structure of claim 1, further comprising:
and searching and counting the complex evolutionary information of the multi-gene structural module according to the detection result of the gene evolutionary event.
8. An evolutionary event detection system for a modular genetic structure, comprising:
a construction module for: constructing a phylogenetic tree based on the target genome sequence;
a marking module for: carrying out the state marking of the genetic structure module on each node of the phylogenetic tree by using a maximum reduction method;
a clustering module for: clustering each node of the phylogenetic tree according to the state mark of the genetic structure module;
the detection module is used for: and according to the clustering result, carrying out vertical evolution analysis and horizontal transfer analysis on the gene structure module to realize detection of the gene evolution event.
9. An electronic device comprising a processor and a memory storing a computer program, characterized in that the processor implements the modular genetic architecture oriented evolutionary event detection method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the modular genetic structure oriented evolution event detection method according to any one of claims 1 to 7.
CN202311150502.8A 2023-09-07 2023-09-07 Evolution event detection method and system for modularized gene structure Active CN116895328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311150502.8A CN116895328B (en) 2023-09-07 2023-09-07 Evolution event detection method and system for modularized gene structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311150502.8A CN116895328B (en) 2023-09-07 2023-09-07 Evolution event detection method and system for modularized gene structure

Publications (2)

Publication Number Publication Date
CN116895328A true CN116895328A (en) 2023-10-17
CN116895328B CN116895328B (en) 2023-12-08

Family

ID=88311033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311150502.8A Active CN116895328B (en) 2023-09-07 2023-09-07 Evolution event detection method and system for modularized gene structure

Country Status (1)

Country Link
CN (1) CN116895328B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040204861A1 (en) * 2003-01-23 2004-10-14 Benner Steven Albert Evolution-based functional proteomics
US20110280907A1 (en) * 2008-11-25 2011-11-17 Max-Planck-Gesellschaft Zur Forderung Der Wissenschaften E.V. Method and system for building a phylogeny from genetic sequences and using the same for recommendation of vaccine strain candidates for the influenza virus
CN102542178A (en) * 2011-12-31 2012-07-04 重庆邮电大学 Gene intron evolution reconstruction device and method
US20180142307A1 (en) * 2014-02-11 2018-05-24 California Institute Of Technology Recording and mapping lineage information and molecular events in individual cells
US20180285519A1 (en) * 2016-12-30 2018-10-04 Brown University Phylogeny tree generation from mixed samples
CN109326328A (en) * 2018-11-02 2019-02-12 西北大学 A kind of extinct plants and animal pedigree evolution analysis method based on pedigree cluster
CN112908410A (en) * 2021-03-01 2021-06-04 上海欧易生物医学科技有限公司 Detection method and system for positive selection gene based on snakekeke process
WO2022072717A1 (en) * 2020-09-30 2022-04-07 University Of Virginia Patent Foundation Method and system for early efficient detection of co-evolutionary sites in evolving bio-networks
CN115691656A (en) * 2022-10-11 2023-02-03 中国科学院计算机网络信息中心 Method and device for accelerating evolution tree of large system
WO2023081413A2 (en) * 2021-11-05 2023-05-11 Lifemine Therapeutics, Inc. Methods and systems for discovery of embedded target genes in biosynthetic gene clusters

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040204861A1 (en) * 2003-01-23 2004-10-14 Benner Steven Albert Evolution-based functional proteomics
US20110280907A1 (en) * 2008-11-25 2011-11-17 Max-Planck-Gesellschaft Zur Forderung Der Wissenschaften E.V. Method and system for building a phylogeny from genetic sequences and using the same for recommendation of vaccine strain candidates for the influenza virus
CN102542178A (en) * 2011-12-31 2012-07-04 重庆邮电大学 Gene intron evolution reconstruction device and method
US20180142307A1 (en) * 2014-02-11 2018-05-24 California Institute Of Technology Recording and mapping lineage information and molecular events in individual cells
US20180285519A1 (en) * 2016-12-30 2018-10-04 Brown University Phylogeny tree generation from mixed samples
CN109326328A (en) * 2018-11-02 2019-02-12 西北大学 A kind of extinct plants and animal pedigree evolution analysis method based on pedigree cluster
WO2022072717A1 (en) * 2020-09-30 2022-04-07 University Of Virginia Patent Foundation Method and system for early efficient detection of co-evolutionary sites in evolving bio-networks
CN112908410A (en) * 2021-03-01 2021-06-04 上海欧易生物医学科技有限公司 Detection method and system for positive selection gene based on snakekeke process
WO2023081413A2 (en) * 2021-11-05 2023-05-11 Lifemine Therapeutics, Inc. Methods and systems for discovery of embedded target genes in biosynthetic gene clusters
CN115691656A (en) * 2022-10-11 2023-02-03 中国科学院计算机网络信息中心 Method and device for accelerating evolution tree of large system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BIOINFO2011: "分子进化研究笔记", Retrieved from the Internet <URL:https://www.jianshu.com/p/cc5f9cac2251> *
LOTUSQ: "进化聚类(Evolutionary Clustering)", Retrieved from the Internet <URL:CSDN博客 https://blog.csdn.net/qq_30057549/article/details/88219384> *
程廷才, 夏庆友, 刘春, 赵萍, 查幸福, 徐汉福, 向仲怀: "家蚕chi、gluE和fruA基因与微生物相应基因的同源性及基因水平转移初探", 遗传学报, no. 10 *
袁伟;吴宏宇;黄青山;: "25株SIV/HIV的进化分析及意义", 复旦学报(自然科学版), no. 03 *
郑姣妹 等: "蛇蛔虫 ITS 及 5.8SrDNA 的克隆及进化分析", 《中国兽医学报 》, vol. 32, no. 5 *
郑巍;罗阿蓉;史卫峰;郑为民;朱朝东;: "系统发育分析中的最大简约法及其优化", 昆虫学报, no. 10 *

Also Published As

Publication number Publication date
CN116895328B (en) 2023-12-08

Similar Documents

Publication Publication Date Title
Zielezinski et al. Benchmarking of alignment-free sequence comparison methods
Zhou et al. Phylogenetic molecular ecological network of soil microbial communities in response to elevated CO2
Rannala et al. The multispecies coalescent model and species tree inference
Siepel et al. Combining phylogenetic and hidden Markov models in biosequence analysis
CA2424031C (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
Rodrigo et al. The perils of plenty: what are we going to do with all these genes?
Stanton‐Geddes et al. Estimating heritability using genomic data
Balaban et al. Fast and accurate distance‐based phylogenetic placement using divide and conquer
Patané et al. Phylogenomics
Paszek et al. Efficient algorithms for genomic duplication models
Liu et al. The community coevolution model with application to the study of evolutionary relationships between genes based on phylogenetic profiles
Beyer et al. A graph‐theoretic approach to the partition of individuals into full‐sib families
Hu et al. Reconstructing ancestral genomic orders using binary encoding and probabilistic models
CN116895328B (en) Evolution event detection method and system for modularized gene structure
CN101894216B (en) Method of discovering SNP group related to complex disease from SNP information
Balboa et al. African bushpigs exhibit porous species boundaries and appeared in Madagascar concurrently with human arrival
Poptsova Testing phylogenetic methods to identify horizontal gene transfer
Wang et al. A new method for rapid genome classification, clustering, visualization, and novel taxa discovery from metagenome
Bohutínská et al. Population genomic analysis of diploid-autopolyploid species
Kornai et al. Hierarchical heuristic species delimitation under the multispecies coalescent model with migration
Suresh et al. Associated subgraph mining in biological network
Davies Factors influencing genetic variation in wild mice
Greenberg Analysis and applications of k-mer based methods in bioinformatics
Rosenberg Gene genealogies
Molik et al. Effects from structure of Metabarcode Sequences on Lossy Analysis of Microbiome Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant