CN110060730A

CN110060730A - A kind of netic module analysis method

Info

Publication number: CN110060730A
Application number: CN201910267199.7A
Authority: CN
Inventors: 苏延森; 祝火乐; 张磊
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-07-26
Anticipated expiration: 2039-04-03
Also published as: CN110060730B

Abstract

The invention discloses a kind of netic module analysis methods, comprising: input gene phenotype double-layer network, gene function similitude network and known disease gene set s relevant to disease phenotype₀；Increase the connection relationship in gene phenotype double-layer network between gene and phenotype；By in gene phenotype double-layer network with s₀Middle disease gene has side to connect and not in s₀In gene as candidate gene, calculate and s be added in the sum of selection Semantic Similarity, Topology Similarity and phenotype relevance maximum candidate gene₀, when candidate gene set no longer significant enrichment GO noumenon function annotation relevant to disease phenotype, biological pathway gene and the differential expression genes in disease phenotype sample and normal sample simultaneously of expansion, remember that current algebra is m, export m-1 for s₀The candidate gene of middle expansion and the known disease gene for thering is side to connect with these expansion candidate genes.

Description

A kind of netic module analysis method

Technical field

The present invention relates to data analysis technique field more particularly to a kind of netic module analysis methods.

Background technique

Currently, the detection method of disease module can substantially be divided into three classes in mankind's interactive network:

First kind research method is to detect disease module based on gene expression data, and gene expression profile data is to disease module Excavation and research for be a kind of more common and effective data resource.Due in available data there are still it is a large amount of mistake, Missing and the number of drawbacks such as high-dimensional, therefore only excavate disease modular structure using only previous clustering algorithm and can not obtain Obtain good effect.

Second class research method be based on single network detect disease module, but the one-sidedness of single network data and Error can have a huge impact the excavation of netic module, cause finally obtained netic module and actual netic module it Between there is very large deviation.

Third class research method is to detect disease module based on multiple networks, in all Multi net voting structures that are based on to module knot Structure carries out in the algorithm of mining analysis, and research emphasis is search more modular structure of frequency of occurrence in all-network. Because multiple networks used in method are all by the gene expression profile building that is highly mutually related with certain disease phenotype It obtains, so the frequency that netic module occurs is higher higher with the degree of relevancy of certain disease phenotype.

Summary of the invention

Technical problems based on background technology, the invention proposes a kind of netic module analysis methods；

A kind of netic module analysis method proposed by the present invention, comprising:

S1, input gene phenotype double-layer network, gene function similitude network and known disease relevant to disease phenotype x Gene sets s₀；

Connection relationship in S2, increase gene phenotype double-layer network between gene and phenotype；

S3, by gene phenotype double-layer network with s₀In any one disease gene have side connect and not in s₀In gene As candidate gene, the Topology Similarity, Semantic Similarity and phenotype relevance of each candidate gene are iterated to calculate, according to topology Similitude and Semantic Similarity increase and delete the connection relationship between gene, recalculate and select Semantic Similarity, topology S is added in the maximum candidate gene of the sum of similitude and phenotype relevance₀, until iterating to calculate to the candidate gene set expanded not Again simultaneously relevant to the disease phenotype x GO noumenon function of significant enrichment annotate, biological pathway gene and in disease phenotype x sample It is m with the algebra for when differential expression genes, remembering expansion candidate gene in normal sample, by m-1 for disease gene set s₀Middle expansion The candidate gene that fills and with expand the target disease base of known disease gene that candidate gene has side to connect as disease phenotype x Because of module and export.

Preferably, in step S1, the gene phenotype double-layer network, specifically:

According to idiotype network A, phenotype similitude network B and gene phenotype relational network C construct gene phenotype Double-level Reticulated The adjacency matrix of network, double-layer network can be expressed asWherein C^TFor the transposed matrix of C.

Preferably, step S2 is specifically included:

Pass throughDefine the phenotype of gene i and gene j in gene phenotype double-layer network Relevance w_ij, wherein spectrum p (i) is the phenotype set comprising disease gene i, p (j) is the phenotype set comprising disease gene j, p (i) ∩ p (j) indicates while including the phenotype set of disease gene i and disease gene j, | p (i) | for the phenotype in set p (i) Number, | p (j) | it is the phenotype number in set p (j), | p (u) | for disease gene number known to phenotype u；

Calculate phenotype relevance w_ijNormalized valueWherein, max_j′(w_ij′) indicate gene table Gene i and gene j ' phenotype relevance are maximum in type double-layer network, are worth for max_j′(w_ij′)；

Calculate gene g in gene phenotype double-layer network_iIt is disease phenotype p_tDisease gene a possibility that:Wherein,Indicate gene g_iAnd g_jPhenotype relevance, | p_t| indicate disease phenotype p_tIt is known Disease gene set；

JudgementWhether it is greater than γ, when the judgment result is yes, the i-th row t in gene phenotype relational network C is set The value of column element is 1, to increase g_iAnd p_tBetween connection relationship, γ be preset variable element.

Preferably, step S3 is specifically included:

S301, definition current iteration number are n, and initialize n=1；

S302, disease gene set s relevant to disease phenotype x in gene phenotype double-layer network is obtained₀Any of base Because of the gene for thering is side to connect, and as candidate gene, it is denoted as c₀={ g₁,g₂,…,g_i,…,g_w, and define nth iteration The Topology Similarity of middle candidate gene i

Wherein, k is the connection number of edges in gene i and gene phenotype double-layer network between other genes, k_sFor gene i and collection Close s₀Connection number of edges between middle gene, N are the gene number in network；

S303, calculate n times iteration all candidate gene Topology Similarities average value W is the number when former generation candidate gene；

S304, by the average value ave_ of the Topology Similarity of candidate gene i and all candidate gene Topology Similarities Topology is compared, if the Topology Similarity of candidate gene i is greater than the average value of all candidate gene Topology Similarities, Gene sets TB is added in candidate gene i；Otherwise gene sets TS is added in candidate gene i；

S305, each gene g in TB is calculated_iWith set s₀In there is side to connect gene between functional similarity it is average Valueif A_ij=1, wherein similar_ijFor gene g_iWith gene g_jBetween function Energy similitude, A_ij=1 indicates gene g_iWith gene g_jThere are side connection, l₁Indicate set s₀In gene and gene g_iThere is side connection Number；

S306, by similar_ijWith ave1_similar_iIt is compared, if similar_ij> ave1_similar_iAnd A_ij=0, then the i-th row j column element A in idiotype network A is set_ij=1, in g_iAnd g_jBetween increase even side, wherein g_i∈ TB, g_j ∈s₀；

S307, each gene g in TS is calculated_iWith set s₀In boundless connection gene between functional similarity it is average Valueif A_ij=0, wherein similar_ijFor gene g_iWith gene g_jBetween function Energy similitude, A_ij=0 indicates gene g_iWith gene g_jBoundless connection, l₂Indicate set s₀In gene and gene g_iBoundless connection Number；

S308, by similar_ijWith ave2_similar_iIt is compared, if similar_ij< ave2_similar_iAnd A_ij=1, then the i-th row j column element A in idiotype network A is set_ij=0, to delete g_iAnd g_jBetween connect side, wherein g_i∈ TS, g_j∈ s₀；

S309, reacquire gene phenotype double-layer network in gene sets s₀Any of gene have side connect base Cause, and as candidate gene, candidate gene set is denoted as c '₀；

S310, c ' is calculated₀The Topology Similarity of middle candidate gene i and obtain normalized value Wherein, Topology_iIndicate the Topology Similarity of candidate gene i, Topology_minIt indicates Minimum of topological similarity in all candidate genes, Topology_maxIndicate that the maximum topology in all candidate genes is similar Property；

S311, c ' is calculated₀Middle candidate gene i Semantic SimilarityGene j is disease base Because of set s₀In gene, z set s₀In gene number；

S312, c ' is calculated₀The Semantic Similarity of middle candidate gene i and obtain normalized value Wherein, Similar_iIndicate the Semantic Similarity of candidate gene i, Similar_minIndicate all times Select the minimum Semantic Similarity value in gene, Similar_maxIndicate the maximum Semantic Similarity in all candidate genes；

S313, candidate gene i are appeared in the disease gene set of e disease phenotype similar with disease phenotype, and this e Phenotype similarity between disease phenotype and disease phenotype x is expressed as three comprising disease gene set and disease gene number Member set O={ (s₁,c₁,t₁),…,(s_k,c_k,t_k),…,(s_e,c_e,t_e), calculating candidate gene i is associated with disease phenotype x's Degree

S314, set of computations c '₀The score score of middle candidate gene i_i=Nsimilar_i-NTopoloy_i+r_i；

S315, by score score_iHighest candidate gene i extends to disease gene set s₀In, n=n+1 is enabled, and judge Whether there is candidate gene in gene phenotype double-layer network, if so, executing step S302, otherwise, executes step S316；

S316, expansion candidate gene set no longer simultaneously significant enrichment GO noumenon function relevant to disease phenotype x Annotation, biological pathway gene and in disease phenotype x sample and normal sample when differential expression genes, note expands candidate gene Algebra be m, m-1 is expanded known to candidate genes are connected for the candidate gene expanded in disease gene set and with these Disease gene as disease phenotype x target disease module and export.

Idiotype network and the phenotype network integration are constituted gene phenotype double-layer network by the present invention, calculate candidate gene and disease Being associated between Topology Similarity and Semantic Similarity and candidate gene between gene disease similar with institute study of disease Property, Topology Similarity, Semantic Similarity and relevance are combined, disease module is detected, by a variety of biological datas And the effective integration of a variety of attributes, to greatly enhance the accuracy of disease module detection algorithm；Further, The interaction of existing protein is increased and deleted by Topology Similarity and Semantic Similarity, and utilizes collaborative filtering Method increases gene phenotype relationship, is adjusting the effective false positive and false negative data of reducing to disease to gene phenotype double-layer network The influence of sick module detection.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of netic module analysis method proposed by the present invention.

Specific embodiment

Referring to Fig.1, a kind of netic module analysis method proposed by the present invention, comprising:

In this step, gene phenotype double-layer network, specifically: according to idiotype network A, phenotype similitude network B, Yi Jiji Because Phenotype Correlation network C constructs gene phenotype double-layer network, the adjacency matrix of double-layer network can be expressed asWherein C^TFor the transposed matrix of C.

In concrete scheme, D is acquired₁L in a database₁It interacts to protein, wherein including N number of protein, egg White matter network can be abstracted as a protein point set V₁With interaction side collection E₁The figure G of composition₁=(V₁,E₁), number of nodes is denoted as N =| V₁|, number of edges is denoted as l₁=| E₁|, E₁Middle each edge has V₁In a pair of of point correspond, adjacency matrix A=(A_ij)_N×N Indicate protein network, A in adjacency matrix_ijValue be to have Bian Xianglian between 1 expression point i and point j, be 0 expression node i and node J is boundless to be connected, and adjacency matrix A is the symmetrical matrix of 0 and 1 element composition, because substantially one-to-one between gene and protein Relationship, so protein network is substantially equivalent to idiotype network.

Acquire D₂L in a database₂To phenotype similarity relation, wherein including M phenotype, phenotype similitude network can be taken out As for a phenotype point set V₂With phenotype similarity relation side collection E₂The figure G of composition₂=(V₂,E₂), number of nodes is denoted as M=| V₂|, side Number scale is l₂=| E₂|, E₂Middle each edge has V₂In a pair of of point correspond, adjacency matrix B=(B_ij)_M×MIndicate phenotype Similitude network, the B in adjacency matrix_ijValue, which is greater than between 0 expression point i and point j, Bian Xianglian, indicates node i and node j for 0 Boundless to be connected, adjacency matrix B is by the symmetrical matrix of the element composition belonged on section [0,1].

Acquire D₃L in a database₃To gene phenotype relationship, wherein a comprising a disease gene of N ' and M phenotype, N ' The set of disease gene composition is denoted as V '₃, gene-Phenotype Correlation network can be abstracted as a gene and phenotype point set V₃=V₂∪ V′₃With gene-Phenotype Correlation side collection E₃The figure G of composition₃=(V₃,E₃), number of nodes is denoted as N+M=| V₃|, number of edges is denoted as l₃=| E₃ |, E₃Middle each edge has V₃In a pair of of point correspond, adjacency matrix C=(C_ij)_N×MIndicate gene phenotype relational network, C in adjacency matrix_ijValue is to have Bian Xianglian between 1 expression point i and point j, for 0 indicate node i with node j is boundless is connected, adjoining Matrix C is the symmetrical matrix of 0 and 1 element composition.

Using idiotype network A, phenotype similitude network B and gene phenotype relational network C construct gene phenotype Double-level Reticulated The adjacency matrix of network, double-layer network can be expressed asWherein C^TFor the transposed matrix of C.

Step S2 increases the connection relationship in gene phenotype double-layer network between gene and phenotype.

This step specifically includes:

In concrete scheme, the interaction of existing protein increase by Topology Similarity and Semantic Similarity and It deletes, and increases gene phenotype relationship using collaborative filtering method, reached to gene phenotype double-layer network structural adjustment Purpose, to effectively reduce the influence that false positive and false negative data detect disease module.

Step S3, by gene phenotype double-layer network with s₀In any one disease gene have side connect and not in s₀In Gene iterates to calculate the Topology Similarity, Semantic Similarity and phenotype relevance of each candidate gene as candidate gene, according to Topology Similarity and Semantic Similarity increase and delete the connection relationship between gene, recalculate and select Semantic Similarity, S is added in the maximum candidate gene of the sum of Topology Similarity and phenotype relevance₀, until iterating to calculate to the candidate gene collection expanded Close no longer simultaneously relevant to the disease phenotype x GO noumenon function of significant enrichment annotate, biological pathway gene and in disease phenotype x In sample and normal sample when differential expression genes, the algebra that note expands candidate gene is m, by m-1 for disease gene set s₀ The candidate gene of middle expansion and with target disease of the known disease gene as disease phenotype x that expands candidate gene and there is side to connect Ospc gene module simultaneously exports.

This step specifically includes:

S301, definition current iteration number are n, and initialize n=1；

In concrete scheme, idiotype network and the phenotype network integration are constituted into gene phenotype double-layer network, because of existing life There is a large amount of mistake and missing in object data, existing gene interaction is carried out by Topology Similarity and Semantic Similarity Increase and delete and increase gene phenotype relationship using collaborative filtering method and gene phenotype double-layer network is adjusted, simultaneously With the involvement of disease phenotype data, to effectively reduce the influence of false positive and false negative data, disease module is improved The accuracy of detection algorithm.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims

1. a kind of netic module analysis method characterized by comprising

S1, input gene phenotype double-layer network, gene function similitude network and known disease gene relevant to disease phenotype x Set s₀；

S3, by gene phenotype double-layer network with s₀In any one disease gene have side connect and not in s₀In gene conduct Candidate gene iterates to calculate the Topology Similarity, Semantic Similarity and phenotype relevance of each candidate gene, similar according to topology Property and Semantic Similarity increase and delete the connection relationship between gene, recalculate with selection Semantic Similarity, topology it is similar Property and the maximum candidate gene of the sum of phenotype relevance s is added₀, no longer same to the candidate gene set expanded until iterating to calculate When significant enrichment relevant to disease phenotype x GO noumenon function annotation, biological pathway gene and in disease phenotype x sample and just In normal sample when differential expression genes, the algebra that note expands candidate gene is m, by m-1 for disease gene set s₀Middle expansion Candidate gene and with target disease gene mould of the known disease gene as disease phenotype x that expands candidate gene and there is side to connect Block simultaneously exports.

2. netic module analysis method according to claim 1, which is characterized in that in step S1, the gene phenotype is double Layer network, specifically:

According to idiotype network A, phenotype similitude network B and gene phenotype relational network C construct gene phenotype double-layer network, The adjacency matrix of double-layer network can be expressed asWherein C^TFor the transposed matrix of C.

3. netic module analysis method according to claim 2, which is characterized in that step S2 is specifically included:

Pass throughGene i in gene phenotype double-layer network is defined to be associated with the phenotype of gene j Property w_ij, wherein spectrum p (i) is the phenotype set comprising disease gene i, p (j) is the phenotype set comprising disease gene j, p (i) ∩ P (j) indicates while including the phenotype set of disease gene i and disease gene j, | p (i) | it is the phenotype number in set p (i), | P (j) | it is the phenotype number in set p (j), | p (u) | for disease gene number known to phenotype u；

Calculate phenotype relevance w_ijNormalized valueWherein, max_j′(w_ij′) indicate that gene phenotype is double-deck Gene i and gene j ' phenotype relevance are maximum in network, are worth for max_j′(w_ij′)；

Calculate gene g in gene phenotype double-layer network_iIt is disease phenotype p_tDisease gene a possibility thatWherein,Indicate gene g_iAnd g_jPhenotype relevance, | p_t| indicate disease phenotype p_tIt is known Disease gene set；

JudgementWhether γ is greater than, when the judgment result is yes, the i-th row t being arranged in gene phenotype relational network C arranges member The value of element is 1, to increase g_iAnd p_tBetween connection relationship, γ be preset variable element.

4. netic module analysis method according to claim 3, which is characterized in that step S3 is specifically included:

S301, definition current iteration number are n, and initialize n=1；

S302, disease gene set s relevant to disease phenotype x in gene phenotype double-layer network is obtained₀Any of gene have side The gene of connection, and as candidate gene, it is denoted as c₀={ g₁,g₂,…,g_i,…,g_w, and define candidate in nth iteration The Topology Similarity of gene i

Wherein, k is the connection number of edges in gene i and gene phenotype double-layer network between other genes, k_sFor gene i and set s₀ Connection number of edges between middle gene, N are the gene number in network；

S304, by the average value ave_Topology of the Topology Similarity of candidate gene i and all candidate gene Topology Similarities It is compared, if the Topology Similarity of candidate gene i is greater than the average value of all candidate gene Topology Similarities, by candidate base Because gene sets TB is added in i；Otherwise gene sets TS is added in candidate gene i；

S305, each gene g in TB is calculated_iWith set s₀In have side connect gene between functional similarity average valueif A_ij=1, wherein similar_ijFor gene g_iWith gene g_jBetween function Similitude, A_ij=1 indicates gene g_iWith gene g_jThere are side connection, l₁Indicate set s₀In gene and gene g_iThere is side to connect Number；

S306, by similar_ijWith ave1_similar_iIt is compared, if similar_ij> ave1_similar_iAnd A_ij= 0, then the i-th row j column element A in idiotype network A is set_ij=1, in g_iAnd g_jBetween increase even side, wherein g_i∈ TB, g_j∈s₀；

S307, each gene g in TS is calculated_iWith set s₀In boundless connection gene between functional similarity average valueif A_ij=0, wherein similar_ijFor gene g_iWith gene g_jBetween function Similitude, A_ij=0 indicates gene g_iWith gene g_jBoundless connection, l₂Indicate set s₀In gene and gene g_iBoundless connection Number；

S308, by similar_ijWith ave2_similar_iIt is compared, if similar_ij< ave2_similar_iAnd A_ij= 1, then the i-th row j column element A in idiotype network A is set_ij=0, to delete g_iAnd g_jBetween connect side, wherein g_i∈ TS, g_j∈s₀；

S309, reacquire gene phenotype double-layer network in gene sets s₀Any of gene have side connect gene, and will It is denoted as c ' as candidate gene, candidate gene set₀；

S311, c ' is calculated₀Middle candidate gene i Semantic SimilarityGene j is disease gene collection Close s₀In gene, z set s₀In gene number；

S313, candidate gene i are appeared in the disease gene set of e disease phenotype similar with disease phenotype, this e disease Phenotype similarity between phenotype and disease phenotype x is expressed as three metasets comprising disease gene set and disease gene number Close O={ (s₁,c₁,t₁),…,(s_k,c_k,t_k),…,(s_e,c_e,t_e), calculate the correlation degree of candidate gene i and disease phenotype x

S315, by score score_iHighest candidate gene i extends to disease gene set s₀In, n=n+1 is enabled, and judge gene Whether there is candidate gene in phenotype double-layer network, if so, executing step S302, otherwise, executes step S316；

S316, expansion candidate gene set no longer simultaneously relevant to the disease phenotype x GO noumenon function of significant enrichment annotate, Biological pathway gene and in disease phenotype x sample and normal sample when differential expression genes, note expands the generation of candidate gene Number is m, and m-1 is expanded the known disease that candidate gene is connected for the candidate gene expanded in disease gene set and with these Gene as disease phenotype x target disease module and export.