CN113113083B - Tumor driving pathway prediction system for collective cell mutation data and protein network - Google Patents

Tumor driving pathway prediction system for collective cell mutation data and protein network Download PDF

Info

Publication number
CN113113083B
CN113113083B CN202110383651.3A CN202110383651A CN113113083B CN 113113083 B CN113113083 B CN 113113083B CN 202110383651 A CN202110383651 A CN 202110383651A CN 113113083 B CN113113083 B CN 113113083B
Authority
CN
China
Prior art keywords
module
node
network
graph
weighted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110383651.3A
Other languages
Chinese (zh)
Other versions
CN113113083A (en
Inventor
吴昊
陈中立
董记华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202110383651.3A priority Critical patent/CN113113083B/en
Publication of CN113113083A publication Critical patent/CN113113083A/en
Application granted granted Critical
Publication of CN113113083B publication Critical patent/CN113113083B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present disclosure provides a system for predicting tumor-driven pathways by collective cellular mutation data and protein networks, comprising: a weighted undirected network graph construction module configured to: constructing a weighted undirected network graph by using a vertex weighting method and an edge weighting method; a weighted directed network graph building module configured to: restarting random walk is carried out on the basis of a thermal diffusion principle to construct a weighted directed graph; an initial candidate module set building module configured to: creating an initial candidate module set with a certain number of genes by utilizing the strong connectivity principle of a directed network graph; an optimal drive module set building module configured to: and splitting the large module in the initial candidate module set by using a derived subgraph mode, and expanding the small module by using a greedy strategy to obtain an optimal drive module set. The constructed biological network not only integrates the biological correlation among genes, but also reflects the topological correlation among the genes. The complex network topological structure characteristic is applied to the biological network.

Description

Tumor driving pathway prediction system of collective cell mutation data and protein network
Technical Field
The disclosure belongs to the technical field of medical biological data processing, and particularly relates to a tumor driving pathway prediction system for collective cell mutation data and protein networks.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
In order to research the generation and pathogenesis of cancer, the research of cancer on the molecular level is deeply promoted, large-scale sequencing data such as cancer genome maps TCGA and ICGC and the like emerge, and the cancer genomics data are continuously improved. Researchers have analyzed the differences between the gene data of normal cells and cancer cells, and found that, although there are thousands of gene mutations in a cancer cell, not every gene mutation affects its cellular function. How to detect the functional high-frequency mutation gene set causing cancer from a large amount of somatic mutation data is one of the important challenges in the medical field.
In recent years, drug targets and drug therapies directed to cancer driver genes have opened up new areas of cancer research and therapy. The reason is that the corresponding drug can specifically select and combine with the driver gene with functional mutation in cancer to act, and can control the expression and transcription level of the driver gene, thereby effectively controlling the development and deterioration of cancer. However, counting single mutant genes does not effectively identify all cancer driver mutations, and therefore, screening for cancer driver modules and dysregulation modules that have an interactive relationship is more biologically significant than screening for single driver mutant genes. The cancer driving module and the disorder module can be used for screening, so that the pathogenesis of the cancer can be deeply understood, more drug targets can be further provided for clinical treatment of the cancer through the interaction of genes on the upstream and downstream of the reasoning module, and reliable theoretical basis and data support are provided for accurate medical treatment or personalized medical treatment.
In recent decades, a number of proteomic experiments have generated protein interaction network data Sets (PPIs) for a variety of different tissues. The protein is a product coded by the gene, the gene is relatively static, the protein is dynamic, the protein is a main embodiment and an executor of biological functions, and can regulate and mediate a plurality of biological activities of cells, and the interaction of the protein plays an important role in the processes of cell structure, transcription process, splice sites, cell cycle control and the like, so that the PPI network recognition pathogenic driving module has better practical guiding significance for researching and preventing pathogenic mechanisms of diseases.
The cancer genomics data provides data guarantee for completely mining the cancer driving module; mature graph theory related theoretical knowledge plays an important role in the construction of biological networks; various optimization algorithms also provide a valuable experience for efficiently and accurately mining cancer driver modules. Therefore, based on human protein interaction network (PPI) and somatic mutation data, the biological data processing, biological network construction, module detection and other aspects of technologies are of great significance to the mining of cancer driver modules.
In the past research, most carcinogenic driving mining models do not have good universality and robustness, and have certain limitations when being applied to different data sets; meanwhile, the problem of heterogeneity of mutant genes cannot be effectively solved, and the accuracy of a carcinogenic driving module is not high.
Disclosure of Invention
In order to overcome the defects of the prior art, the disclosure provides a tumor driving pathway prediction system for collective cell mutation data and protein networks, which not only combines the mutual exclusion and coverage among genes widely existing in a genome map, but also utilizes the similarity among complex network nodes, effectively avoids the problem of mutation heterogeneity, and improves the accuracy and integrity of the mining of carcinogenic modules.
In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
in a first aspect, a tumor-driven pathway prediction system for collective cellular mutation data and protein networks is disclosed, comprising:
a weighted undirected network graph construction module configured to: on the basis of a protein interaction network, a weighted undirected network graph is constructed by using a vertex weighting method and an edge weighting method;
a weighted directed network graph building module configured to: on the basis of constructing an interaction network, restarting random walk is performed on the basis by utilizing a thermal diffusion principle to construct a weighted directed graph;
an initial candidate module set building module configured to: creating an initial candidate module set with a certain number of genes by utilizing the strong connectivity principle of a directed network graph;
an optimal driver module set building module configured to: and splitting the large module in the initial candidate module set by using a derived subgraph mode, and expanding the small module by using a greedy strategy to obtain an optimal drive module set.
The further technical scheme also comprises the following steps: a data processing module configured to: aiming at somatic mutation data, firstly, ultra-mutant samples and genes with low expression in all tumor types are screened out, and then, a gene expression filter is applied to delete the genes which do not meet the requirement.
In a further technical solution, the data processing module is further configured to: for combined human PPI network data, including high-quality interaction database HINT and human interaction database HI 2012;
performing merging operation on the interaction relation data in the HINT and the interaction relation data in the HI2012, and then deleting closed loops and repeated edges in the network to finally obtain a protein interaction network formed by protein and interaction;
operations of merging and deleting closed loops and repeated edges are performed on the hit and HI2012 protein interaction network data.
According to the further technical scheme, when the weighted undirected network graph building module builds the weighted undirected network graph, for each vertex, the vertex weight is calculated by using the node degree, and for each edge, the edge weight is calculated by using the restarting random walk; and respectively calculating mutual exclusion, coverage and similarity between gene pairs to construct a gene network.
According to the further technical scheme, the weighted directed network graph building module can be used for restarting random walk, namely that after a source node gene is transferred to an adjacent node with a certain probability, the source node gene returns to the source node again by using the restarting probability, and the process is iterated repeatedly until the source node gene tends to a stable state;
and restarting the random walk to create a directed edge with the weight for each pair of gene nodes, and finally realizing the construction of the edge weighted directed graph.
According to the further technical scheme, when the initial candidate module set is constructed by the initial candidate module set construction module, for the weighted directed graph, the strong connectivity principle of the directed network graph is utilized to generate the strong connectivity component of the directed network graph to be used as a module to be added to the initial module set, the modules smaller than a set value in the initial module set are deleted, the minimum weight edge in each module is deleted in an iterative mode until the size of the initial module set does not exceed the total number, and finally the initial candidate module set is obtained.
In the splitting process, any derived subgraph of the directed graph has a node with the largest degree value in any derived subgraph, if the number of the local area network nodes of the largest node is not less than the set value, the largest node is divided into seed modules, otherwise, the largest node is divided into leaf modules.
According to the technical scheme, when the small-sized module is expanded by using a greedy strategy, the optimal driving module set building module adds the leaf module meeting the conditions into the seed module by using the greedy strategy, selects the leaf node connected with any node in the seed module, judges whether the expansion is carried out by using an expansion function, if the expansion conditions are met, the node is added into the seed module, and otherwise, the node is not added.
Further technical solution, the system further includes a driver module set verification module configured to: and (3) acquiring the enrichment of the driving module based on the known path by using a gene set enrichment analysis tool, and simultaneously calculating the enrichment of the driving module on one path.
In a second aspect, a computing device is disclosed that is implemented on a server, the server comprising:
a weighted undirected network graph construction module configured to: on the basis of a protein interaction network, a weighted undirected network graph is constructed by using a vertex weighting method and an edge weighting method;
a weighted directed network graph building module configured to: on the basis of constructing an interaction network, restarting random walk is performed on the basis by utilizing a thermal diffusion principle to construct a weighted directed graph;
an initial candidate module set building module configured to: creating an initial candidate module set with a certain number of genes by utilizing the strong connectivity principle of a directed network graph;
an optimal drive module set building module configured to: and splitting the large module in the initial candidate module set by using a derived subgraph mode, and expanding the small module by using a greedy strategy to obtain an optimal drive module set.
The above one or more technical solutions have the following beneficial effects:
the carcinogenic module mined based on the single cancer data set does not have good universality and robustness, and has certain limitation when being applied to a new data set. The technical scheme disclosed by the invention integrates somatic mutation data and a protein interaction network, and combines the mutual exclusion and coverage among genes and the similarity among complex network nodes which are widely existed in a genome map, so that the problem of mutation heterogeneity is effectively avoided, and the accuracy and the integrity of carcinogenic module mining are improved.
In the research of the carcinogenic module algorithm, most of the existing algorithms use two characteristics of high coverage and high mutual exclusion of a gene set to construct a biological weighting network, and part of the algorithms only consider the mutual exclusion characteristic among genes. The technical scheme disclosed by the invention combines the topological structure correlation of a complex network on the basis of utilizing two characteristics of high coverage and high mutual exclusion, and adds the topological structure similarity among gene nodes. The biological network constructed in this way not only integrates the biological correlation among genes, but also reflects the topological correlation among genes. The complex network topological structure characteristic is applied to the biological network.
The scheme can accurately mine a driving module set with higher biological relevance and statistical significance, and has an important promotion effect on researching the pathogenesis and the drug target of the cancer.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a comparison of F-measure values for embodiments of the present disclosure;
FIG. 2 is a comparison of module enrichment in accordance with an embodiment of the disclosure
FIG. 3 is a flow chart of a driver module mining algorithm.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
Aiming at the fact that most of the current carcinogenic driving mining models do not have good universality and robustness, and have certain limitations when being applied to different data sets; meanwhile, the problem of heterogeneity of mutant genes cannot be effectively solved, and the accuracy of a carcinogenic driving module is not high. The technical scheme not only combines the mutual exclusion and coverage among genes widely existing in the genome map, but also utilizes the similarity among complex network nodes, effectively avoids the problem of mutation heterogeneity, and improves the accuracy and the integrity of the carcinogenic module mining.
Example one
The present example discloses a tumor-driven pathway prediction system for collective cell mutation data and protein networks, comprising:
a weighted undirected network graph construction module configured to: on the basis of a protein interaction network, a weighted undirected network graph is constructed by using a vertex weighting method and an edge weighting method;
a weighted directed network graph building module configured to: on the basis of constructing an interaction network, restarting random walk is performed on the basis by utilizing a thermal diffusion principle to construct a weighted directed graph;
an initial candidate module set building module configured to: creating an initial candidate module set with a certain number of genes by utilizing the strong connectivity principle of a directed network graph;
an optimal drive module set building module configured to: and splitting the large module in the initial candidate module set by using a derived subgraph mode, and expanding the small module by using a greedy strategy to obtain an optimal drive module set.
During specific implementation, based on the implementation of the ECSWalk algorithm, the specific module expression further includes:
a data processing module: data are from somatic mutation data of TCGA and combined human PPI network data of hit + HI 2012.
Wherein the somatic mutation data is from a TCGA pan-carcinoma dataset comprising 12 cancer types, consisting of SNVs of 20472 genes from 3281 samples and CNAs of 720 genes from 4334 samples;
the data preprocessing process is as follows: the hypermutant samples and genes with low expression in all tumor types were first screened out, then gene expression filters were applied to delete 218 genes without significant mutation in SNVs, to delete contradictory genes in CNAs, to delete 5 samples without SNV and 1973 samples with CNA only, and to delete 7894 genes <3RNA-seq reads in > 30% of tumors of each cancer type, resulting in a data set containing 3110 samples for a total of 11565 individual cell mutant genes.
The PPI network data hit + HI2012 includes a high-quality interaction database (hit) and a human interaction database (HI 2012).
The data processing process comprises the steps of carrying out merging operation on interaction relation data in HINT and interaction relation data in HI2012, then deleting closed loops and repeated edges in the network, finally obtaining a protein interaction network formed by 9858 proteins and 40704 interactions, and carrying out merging operation and deleting closed loops and repeated edges on HINT + HI2012 protein interaction network data.
In a specific implementation, PPI is represented by G ═ (V, E), where node V ═ u 1 ,u 2 ,u 3 ,...,u n ) The PPI network is represented by a corresponding set of mutated genes, and the edge set E ═ { E ═ (u ═ is) i ,u j ) Denotes a set of protein-interaction relationships, wherein each vertex u i Epsilon V, gene g i Corresponding to the corresponding protein u in the network i Non-directional edge (u) i ,u j ) Epsilon E represents a gene pair (g) i ,g j ) Correspond toInteraction between proteins, therefore g i Both the gene and the corresponding protein in G.
The construction of the edge weighting network specifically includes: a weighted undirected network graph construction module and a weighted directed network graph construction module.
The general description is as follows: the weighted network graph construction includes three attributes: mutual exclusivity, coverage, and similarity. The calculation method comprises the following steps:
1.S i represents gene g i The set of patient samples of (a) is,
Figure BDA0003013980920000081
representing a set of genes in a network node, for any pair of genes g i ,g j ∈M,g i ≠g j If, if
Figure BDA0003013980920000082
The genes in M are mutually exclusive.
The mutual exclusivity of M is represented as:
Figure BDA0003013980920000083
the coverage of M is expressed as:
Figure BDA0003013980920000084
3. in the embodiment, the similarity between the complex networks is applied to the biological network, and the similarity construction comprises two main steps, namely, the probability set is constructed, and the node similarity is defined by combining the JS divergence.
In constructing a probabilistic set, d i Degree of the first gene node in the network, d max For maximum node degree, the probability set of each node has N node elements, where N is d max +1. Gene node g i The sum of the degrees of nodes in its local area network is expressed as:
Figure BDA0003013980920000085
wherein n is the base factor in the local area network, d (j) is the gene node g i Degree of the jth gene node in the local area network, gene node g in the local area network i Is defined as:
Figure BDA0003013980920000086
for gene node g i Normalizing the discrete probability of and connecting the gene nodes g in the local area network i The discrete probability obtained by sequencing from big to small to obtain a discrete probability set is expressed as: p (i) ═ p i (1),p i (2),...,p i (n),...,p i (N)). Node g i And g j The divergence values between are expressed as:
Figure BDA0003013980920000087
to understand the asymmetry problem, so that swapping the positions of P (i) and P (j) can yield the same result, a JS divergence value is obtained based on the KL divergence value, expressed as:
Figure BDA0003013980920000088
thus, the similarity between gene pairs
Figure BDA0003013980920000089
Figure BDA0003013980920000091
SIM(g i ,g j ) A larger value indicates that two gene nodes in the network are more similar.
A weighted undirected network graph construction module configured to: on the basis of the protein interaction network, a weighted undirected network graph G is constructed by using a vertex weighting method and an edge weighting method ω (ii) a For each vertex g i E.g., V, the vertex weight is ω (g) i )=CD({g i For each edge (g) i ,g j ) E is left to E, the edge weight is
Figure BDA0003013980920000092
Wherein mutual exclusion between gene pairs
Figure BDA0003013980920000093
Figure BDA0003013980920000094
Ne(g i ) Represents a node g i The local area network of (a) is,
Figure BDA0003013980920000095
gene pair-to-pair coverage CD (g) i ,g j )=CD({g i })×CD({g j }; similarity between gene pairs
Figure BDA0003013980920000096
Figure BDA0003013980920000097
With respect to the weighted directed network graph building module, configured to: the method comprises the following steps of performing restart random walk on the basis of constructing an interaction network by utilizing a thermal diffusion principle to construct a weighted directed graph, wherein the restart random walk can be understood as that a source node gene returns to a source node again by utilizing a restart probability after transferring to an adjacent node with a certain probability, the process is iterated repeatedly until the gene tends to a stable state, and the iteration formula is as follows: f t+1 =(1-β)PF t +βF 0 Wherein F is 0 As an initial state of the source node, F 0 =CD(g i );F t Representing the probability distribution at time t; the parameter beta represents the restart probability of the neighbor node returning to the source node and is used for controlling the heat quantity diffused to the direct neighbor node or the rest part of the network by the source node gene, the proper beta is selected so that all the source nodes can keep most of the heat quantity in the direct neighbor node, the value of beta is 0.4, P represents the transfer probability matrix of the restart random walk process and is in positive correlation with the edge weight,
Figure BDA0003013980920000098
restarting random walk in steady state, edge weight value is according to F ═ β (I- (1- β) (P (g) i ,g j )) -1 F 0 And (6) solving.Restarting random walk to each pair of gene nodes g i And g j (i ≠ j) creating a directed edge with the weight of F, and finally realizing the edge weighted directed graph G d And (4) constructing.
In the system, the drive module set detection module comprises an initial candidate module set construction module and an optimal drive module set construction module;
specifically, the initial candidate module set building module is configured to: creating an initial candidate module set with a certain number of genes by utilizing the strong connectivity principle of a directed network graph; for weighted directed graph G d Generating G using the strong connectivity principle of a directed net graph d The strong connectivity component is added into an initial module set P as a module, the modules smaller than min _ module _ size in the P are deleted, and the weighted minimum edge in each module is deleted in an iteration mode until the size of the P is not larger than total _ genes. The initial candidate set P ═ is finally obtained (M1, M2.., Mr).
The optimal drive module set building module comprises a splitting module set large module and an expanding module set leaf module, and is configured as follows: and reasonably splitting the large module in the candidate module set by using a derived subgraph mode, and expanding the small module by using a greedy strategy to obtain an optimal drive module set.
Wherein the splitting module concentrates large modules, for the weighted directed graph G d And a module M q ,L={G d (M q ) The node is corresponding to the gene in the directed graph, the module with the node number larger than the split _ size is split as a large module, and G is used for splitting in the splitting process c E splitting into directed graph M q V' is G c Has the node with the largest out value, IN (v') represents G c IN the local area network of middle v ', if the number of IN (v') nodes is not less than min _ module _ size, the local area network is divided into seed modules, otherwise, the local area network is divided into leaf modules, and the directed graph G is divided into c And the rest strongly connected subgraphs are split in the same way.
The expansion module centralizes the leaf module, and leaves meeting the conditions by using a greedy strategyAdding the module into the seed module, selecting leaf nodes connected with any node in the seed module, and expressing an expansion function as follows:
Figure BDA0003013980920000101
wherein,
Figure BDA0003013980920000102
representing a node g in a leaf module m The weight average value of the edges connecting with the nodes in the seed module,
Figure BDA0003013980920000103
representing a node g in a leaf module m The average value of the weights of the edges connecting with other nodes in the leaf module if the node g m The average weight value of the edges of the nodes connected with the seed module is not less than the average weight value of the edges of the nodes connected with the outside, namely G (G) m ) (g, the node is added to the seed module, otherwise, the node is not added, and the specific scheme is shown in fig. 3.
In order to evaluate the accuracy of the mining drive module, an F-measure evaluation method is used for measuring the accuracy of the drive module based on a known channel, the higher the F-measure value is, the more the drive module can be enriched on the known biological channel, and the higher the accuracy of the mining drive module is, the F-measure calculation formula is as follows:
Figure BDA0003013980920000111
Figure BDA0003013980920000112
Figure BDA0003013980920000113
the gene set enrichment analysis tool DAVID was used to obtain the enrichment of the driver module based on the known pathway, while the enrichment of the driver module on one pathway was calculated using the following formula.
Figure BDA0003013980920000114
Wherein N represents the number of all genes, K represents the number of genes in a known biological pathway, N represents the number of genes in an oncogenic module, K represents the number of genes of which a known pathway overlaps with an oncogenic driver module, a driver module with p-value <0.01 is set as a positive class, otherwise, the driver module is a negative class, all p-values are corrected by using a Benjamin-Hochberg method, TP represents the number of modules for predicting a positive class into a positive class, FP represents the number of modules for predicting a negative class into a positive class, TN represents the number of modules for predicting a negative class into a negative class, and FN represents the number of modules for predicting a positive class into a negative class.
The carcinogenic driving module mined in various cancer data by the method has a high F-measure value and a low p-value, and the method is proved to have good effect by combining the complex network characteristics and the biological characteristics to mine the carcinogenic driving module. The characteristics of the complex network topology structure are applied to the biological network, the pathogenesis of the cancer can be researched from the inherent attributes of the data, and the direction is indicated for scientific researchers and medical staff to develop related researches from the aspect of the complex network topology structure. Meanwhile, the scheme accurately identifies the target gene set of common cancers, deepens the understanding and the understanding of the pathogenesis of the cancers, provides a methodology for deeply exploring the pathogenesis of the cancers, further provides more drug targets for clinical treatment of the cancers through the interaction of upstream and downstream genes of the reasoning driving module, and provides reliable theoretical basis and data support for precise medical treatment or personalized medical treatment. Therefore, the method has stronger theoretical guiding significance and practical value for cancer diagnosis, treatment and drug targets.
Compared with the Dendrix algorithm and the Multi-Dendrix algorithm, the ECSWalk algorithm does not need to pre-specify the number of genes in a driving channel and the maximum number of genes in each driving channel, and does not need to solve the problem within a preset time range. The ECSWalk algorithm is based on the construction of an interaction network, a weighted directed network graph is constructed by using a restarting random walk algorithm, an initial candidate module set with a certain gene quantity is created by using the strong connectivity principle of the directed graph, then, large modules in the candidate module set are reasonably split by using a derived subgraph mode, and small modules are expanded by using a greedy strategy, so that an optimal drive module set is obtained. Unlike the Dendrix algorithm and the Multi-Dendrix algorithm, the ECSWalk algorithm uses multiple cancer data, and the identified driver module has higher integrity and accuracy.
Compared with the Hotnet2 algorithm, the Hotnet2 algorithm only considers the degree of the vertex, namely the coverage of the genes, in the conversion probability in the random walk process, and ignores the mutual exclusion among the genes, so that the problem of gene mutation heterogeneity cannot be effectively solved in a pan-cancer data experiment with large difference.
Compared with the mexcowelk algorithm, the mexcowelk algorithm does not well consider the mutual exclusion among gene sets in the process of expanding small modules, so that the module identification accuracy is not high. The ECSWalk algorithm utilizes a greedy strategy in the process of expanding the small-sized module, and the module identification accuracy is effectively improved.
The enrichment performance of ECSWalk, mexcowl and HotNet2 algorithms on the pathway by DAVID enrichment analysis is shown in figure 1. When p-value is less than 0.01, in 7 cancer comparisons, the ECSWalk algorithm is surprisingly found to have F-measure values of 1 for 4 cancers, which indicates that the ECSWalk algorithm has an accuracy and a recall rate of 1 in the 4 cancers at the same time, which means that both the predicted value and the actual value are positive classes, i.e. both the predicted value and the actual value are less than 0.01, which indicates that the ECSWalk algorithm has extremely high accuracy and effectiveness in the aspect of drive module detection. In the other 3 cancers, UCEC, NSCLC and PAAD, the ECSWalk algorithm had higher F-measure values than the mexcowl and HotNet2 algorithms and the recall rate was 1 in both UCEC and PAAD, indicating that the ECSWalk algorithm is more accurate than the set of driver modules mined by the mexcowl and HotNet2 algorithms and has higher enrichment in the known biological pathway, thus the ECSWalk algorithm showed better driver module detectability in the 7 cancer data.
As can be seen from fig. 2, compared to the Hotnet2 algorithm and the mexcowlk algorithm, the ECSWalk algorithm finds more modules with stronger biological relevance and higher statistical significance.
Those skilled in the art will appreciate that the modules or steps of the present disclosure described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code executable by computing means, whereby the modules or steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof may be fabricated into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (9)

1. A system for predicting tumor-driven pathways by using collective cell mutation data and protein networks, which is characterized by comprising:
a weighted undirected network graph construction module configured to: on the basis of a protein interaction network, a weighted undirected network graph is constructed by using a vertex weighting method and an edge weighting method;
a weighted directed network graph building module configured to: on the basis of constructing an interaction network, restarting random walk is performed on the basis by utilizing a thermal diffusion principle to construct a weighted directed graph;
an initial candidate module set building module configured to: creating an initial candidate module set with a certain number of genes by utilizing the strong connectivity principle of a directed network graph;
an optimal drive module set building module configured to: splitting a large module in the initial candidate module set by using a derived subgraph mode, and expanding a small module by using a greedy strategy to obtain an optimal drive module set;
when the optimal drive module set building module expands the small-sized module by using a greedy strategy, adding a leaf module meeting conditions into the seed module by using the greedy strategy, selecting a leaf node connected with any node in the seed module, and judging whether the expansion is carried out by using an expansion function, wherein the expansion function is expressed as:
Figure FDA0003711412760000011
wherein,
Figure FDA0003711412760000012
representing a node g in a leaf module m The weight average value of the edges connecting with the nodes in the seed module,
Figure FDA0003711412760000013
representing a node g in a leaf module m The average value of the weights of the edges connecting with other nodes in the leaf module if the node g m And adding the node into the seed module if the average weight value of the edge connected with the node in the seed module is not less than the average weight value of the edge connected with the node and the outside, otherwise, not adding.
2. The system for predicting tumor drive pathways using collective cellular mutation data and protein networks of claim 1, further comprising: a data processing module configured to: aiming at somatic mutation data, firstly, ultra-mutant samples and genes with low expression in all tumor types are screened out, and then, a gene expression filter is applied to delete the genes which do not meet the requirement.
3. The system of claim 2, wherein the data processing module is further configured to: for combined human PPI network data, including high-quality interaction database HINT and human interaction database HI 2012;
performing merging operation on the interaction relation data in the HINT and the interaction relation data in the HI2012, and then deleting closed loops and repeated edges in the network to finally obtain a protein interaction network formed by protein and interaction;
operations of merging and deleting closed loops and repeated edges are performed on the hit and HI2012 protein interaction network data.
4. The system of claim 1, wherein the weighted undirected network graph construction module, when constructing the weighted undirected network graph, calculates vertex weights for each vertex using node degrees and edge weights for each edge using restart random walks; and respectively calculating mutual exclusion, coverage and similarity between gene pairs to construct a gene network.
5. The system of claim 1, wherein the weighted directed network graph building block is configured to restart random walk by transferring a gene of a source node to a neighboring node with a certain probability, and then return to the source node again with the restart probability, and the process is repeated until the gene of the source node reaches a stable state;
and restarting the random walk to create a directed edge with the weight for each pair of gene nodes, and finally realizing the construction of the edge weighted directed graph.
6. The system of claim 1, wherein the initial candidate module set constructing module constructs an initial candidate module set, and generates a strong connectivity component of the directed net graph as a module to be added to the initial module set by using a strong connectivity principle of the directed net graph for the weighted directed graph, and deletes a module smaller than a set value in the initial module set, and iteratively deletes a weighted minimum edge in each module until the size of the initial module set does not exceed the total size, thereby obtaining the initial candidate module set.
7. The system of claim 1, wherein when the optimal driver module set building module performs rational splitting of large modules in the candidate module set using a derived subgraph approach, for a derived subgraph set of nodes in the weighted directed graph, modules with a node number greater than their set values will be split as large modules; in the splitting process, any derived subgraph of the directed graph has a node with the largest output value, if the number of the local area network nodes of the largest node is not less than the set value, the node is divided into seed modules, otherwise, the node is divided into leaf modules.
8. The system of claim 1, further comprising a drive module set validation module configured to: and (3) acquiring the enrichment of the driving module based on the known path by using a gene set enrichment analysis tool, and simultaneously calculating the enrichment of the driving module on one path.
9. A computing device implemented on a server, the server comprising:
a weighted undirected network graph construction module configured to: on the basis of a protein interaction network, a weighted undirected network graph is constructed by using a vertex weighting method and an edge weighting method;
a weighted directed network graph building module configured to: on the basis of constructing an interaction network, restarting random walk is performed on the basis by utilizing a thermal diffusion principle to construct a weighted directed graph;
an initial candidate module set building module configured to: creating an initial candidate module set with a certain number of genes by utilizing the strong connectivity principle of a directed network graph;
an optimal drive module set building module configured to: splitting a large module in the initial candidate module set by using a derived subgraph mode, and expanding a small module by using a greedy strategy to obtain an optimal drive module set; when the optimal drive module set building module expands the small-sized module by using a greedy strategy, adding a leaf module meeting conditions into the seed module by using the greedy strategy, selecting a leaf node connected with any node in the seed module, and judging whether the expansion is carried out by using an expansion function, wherein the expansion function is expressed as:
Figure FDA0003711412760000041
wherein,
Figure FDA0003711412760000042
representing a node g in a leaf module m The weight average value of the edges connecting with the nodes in the seed module,
Figure FDA0003711412760000043
representing a node g in a leaf module m The average value of the weights of the edges connecting with other nodes in the leaf module if the node g m And adding the node into the seed module if the average weight value of the edge connected with the node in the seed module is not less than the average weight value of the edge connected with the node and the outside, otherwise, not adding.
CN202110383651.3A 2021-04-09 2021-04-09 Tumor driving pathway prediction system for collective cell mutation data and protein network Active CN113113083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110383651.3A CN113113083B (en) 2021-04-09 2021-04-09 Tumor driving pathway prediction system for collective cell mutation data and protein network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110383651.3A CN113113083B (en) 2021-04-09 2021-04-09 Tumor driving pathway prediction system for collective cell mutation data and protein network

Publications (2)

Publication Number Publication Date
CN113113083A CN113113083A (en) 2021-07-13
CN113113083B true CN113113083B (en) 2022-08-09

Family

ID=76715082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110383651.3A Active CN113113083B (en) 2021-04-09 2021-04-09 Tumor driving pathway prediction system for collective cell mutation data and protein network

Country Status (1)

Country Link
CN (1) CN113113083B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171779B (en) * 2022-07-13 2023-09-22 浙江大学 Cancer driving gene prediction device based on graph attention network and multiple groups of chemical fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108292326A (en) * 2015-08-27 2018-07-17 皇家飞利浦有限公司 Carry out the integration method and system that the patient-specific body cell of identification function distorts for using multigroup cancer to compose
CN111581946A (en) * 2020-04-21 2020-08-25 上海爱数信息技术股份有限公司 Language sequence model decoding method
CN111899882A (en) * 2020-08-07 2020-11-06 北京科技大学 Method and system for predicting cancer

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12068059B2 (en) * 2017-01-25 2024-08-20 Whitehead Institute For Biomedical Research Methods for building genomic networks and uses thereof
EP3921834A1 (en) * 2019-02-07 2021-12-15 Bioclue BV Biological sequencing
CN111599406B (en) * 2020-05-25 2023-08-04 江南大学 Global multi-network comparison method combined with network clustering method
CN111816246B (en) * 2020-05-27 2023-01-10 上海大学 Method for identifying driving gene from difference network
CN112259163B (en) * 2020-10-28 2022-04-22 广西师范大学 Cancer driving module identification method based on biological network and subcellular localization data
CN112435714B (en) * 2020-11-03 2021-07-02 北京科技大学 Tumor immune subtype classification method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108292326A (en) * 2015-08-27 2018-07-17 皇家飞利浦有限公司 Carry out the integration method and system that the patient-specific body cell of identification function distorts for using multigroup cancer to compose
CN111581946A (en) * 2020-04-21 2020-08-25 上海爱数信息技术股份有限公司 Language sequence model decoding method
CN111899882A (en) * 2020-08-07 2020-11-06 北京科技大学 Method and system for predicting cancer

Also Published As

Publication number Publication date
CN113113083A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
Yu et al. BRWLDA: bi-random walks for predicting lncRNA-disease associations
Li et al. Computational approaches for detecting protein complexes from protein interaction networks: a survey
CN105653846A (en) Integrated similarity measurement and bi-directional random walk based pharmaceutical relocation method
CN114093527B (en) Drug repositioning method and system based on spatial similarity constraint and nonnegative matrix factorization
Milano et al. Glalign: A novel algorithm for local network alignment
Ali et al. A comprehensive review of artificial intelligence approaches in omics data processing: evaluating progress and challenges
Meng et al. IGLOO: Integrating global and local biological network alignment
Liu et al. Pathogenic gene prediction based on network embedding
CN113113083B (en) Tumor driving pathway prediction system for collective cell mutation data and protein network
Joodaki et al. A scalable random walk with restart on heterogeneous networks with Apache Spark for ranking disease-related genes through type-II fuzzy data fusion
Liu et al. A Network Hierarchy-Based method for functional module detection in protein–protein interaction networks
Xie et al. DHOSGR: lncRNA-disease association prediction based on decay high-order similarity and graph-regularized matrix completion
Alaimo et al. Computational methods to Investigate the Impact of miRNAs on pathways
Zhu et al. Discovering large conserved functional components in global network alignment by graph matching
Wang et al. Detecting protein complexes with multiple properties by an adaptive harmony search algorithm
Tuncay et al. SUMONA: A supervised method for optimizing network alignment
CN116631496A (en) miRNA target prediction method and system based on multilayer heterograms and application
Xu et al. PEWOBS: an efficient Bayesian network learning approach based on permutation and extensible ordering-based search
Wang et al. Prediction of the disease causal genes based on heterogeneous network and multi-feature combination method
Mahalanabis et al. Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets
Abbas et al. An evolutionary algorithm with heuristic operator for detecting protein complexes in protein interaction networks with negative controls
Wang et al. A graph-based algorithm for estimating clonal haplotypes of tumor sample from sequencing data
CN111192639A (en) Complex network-based tumor metastasis key gene retrieval method
Mishra et al. Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network
Li et al. Inferring DTIs based on similarity clustering and CaGCN-DTI model from heterogeneous network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant