KR101853916B1 - Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes - Google Patents

Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes Download PDF

Info

Publication number
KR101853916B1
KR101853916B1 KR1020160041518A KR20160041518A KR101853916B1 KR 101853916 B1 KR101853916 B1 KR 101853916B1 KR 1020160041518 A KR1020160041518 A KR 1020160041518A KR 20160041518 A KR20160041518 A KR 20160041518A KR 101853916 B1 KR101853916 B1 KR 101853916B1
Authority
KR
South Korea
Prior art keywords
domain
protein
pathway
equation
biological pathway
Prior art date
Application number
KR1020160041518A
Other languages
Korean (ko)
Other versions
KR20170114504A (en
Inventor
이인석
심정은
Original Assignee
연세대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 연세대학교 산학협력단 filed Critical 연세대학교 산학협력단
Priority to KR1020160041518A priority Critical patent/KR101853916B1/en
Publication of KR20170114504A publication Critical patent/KR20170114504A/en
Application granted granted Critical
Publication of KR101853916B1 publication Critical patent/KR101853916B1/en

Links

Images

Classifications

    • G06F19/18
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • G06F19/12

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Algebra (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The present invention relates to a method for determining biological pathway specificity of a network-based protein domain, comprising the steps of: (a) analyzing the similarity of a protein inter-domain profile based on weighted mutual information (WMI) measurement; (b) constructing a mutual co-pathway network of proteins based on the similarity of the protein domain profile; And (c) assessing the association between the protein domain and the biological pathway.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a method for determining a biological pathway specificity of a protein domain and a disease gene discovery method using the same,

The present invention relates to a method for determining the bio-pathway specificity of a network-based protein domain and a method for discovering a disease gene using the same.

Protein domains are structural, evolutionary, and functional units of proteins and provide information about biological pathway associations.

However, since many protein domains are involved in biological pathways in various cellular processes, annotation to the biological pathways of the protein domains remains an incomplete task.

For example, the winged helix-turn-helix DNA-binding domain is found in DNA-binding proteins associated with multiple biological pathways.

Thus, in spite of the functional and structural characteristics of the protein domain, the association of the protein domain to a particular biological pathway can not be guaranteed.

However, some protein domains may have specific biological pathways and high specificity. Furthermore, if the domain of the protein is highly specific to a particular biological pathway, biological pathway annotation can be more accurate and easier. In conclusion, given the fact that most human diseases are associated with specific biological pathways, the identification of protein domains associated with specific pathways can facilitate the study of disease.

Recently, the Genome Wide Association Study is a technique for identifying causal gene mutations related to the occurrence of diseases through analysis of large-scale single nucleotide polymorphism (SNP) analysis.

However, the Genome Wide Association Study (GWAS) has insufficient statistical power due to the uncontrolled conditions such as limited sample numbers and population structure, and most of the observed genetic variations are related to disease Genetic linkage is not recognized.

The present inventors have assumed that if the protein domain is important for the operation of the biological pathway, the genes involved in the biological pathway are likely to share the same domain.

In addition, the present inventors have tested whether the domain-sharing pattern can identify co-path associations, thereby developing a domain-based scoring technique (DOMICS), which is a network-based scoring technique, Specific domains (Pathway-Specific Domains) were selected.

Finally, the inventors have developed a method for screening disease-associated genes by enhancing the statistical power of the results of the analysis of the full-length genome association through biological pathway-specific domains.

It is an object of the present invention to provide a method for inferring a function of a protein domain through a functional network and analyzing specificity between a protein domain and a biological pathway.

According to an aspect of the present invention, there is provided a method of analyzing the similarity of a protein inter-domain profile based on (a) weighted mutual information (WMI) measurement method; (b) constructing a mutual co-pathway network of proteins based on the similarity of the protein domain profile; And (c) assessing the association between the protein domain and the biological pathway.

In one embodiment, in the step (a), the weight mutual information measurement method may assign a weight value calculated according to Equation 1 defined according to the rarity of a domain to each protein domain.

[Equation 1]

Figure 112016032793544-pat00001

In one embodiment, the step (c) comprises: evaluating a correlation between the classified protein and a biological pathway based on Bayesian statistics; And assessing the association between the domain and the biological pathway using the protein-biological pathway association and the domain profile; And measuring the amount of biological pathway information of the domain.

In one embodiment, the correlation between the classified protein and the bio-pathway in the step (c) may be determined by converting a Protein-Pathway Association Score (PPA score) into a probability score according to the following expressions Can be calculated.

[Equation 5]

Figure 112016032793544-pat00002

[Equation 6]

Figure 112016032793544-pat00003

[Equation 7]

Figure 112016032793544-pat00004

In one embodiment, the association between the domain and the biological pathway in step (c) may be calculated as a Domain-Pathway Association score (DPA score) according to Equation (9).

[Equation 9]

Figure 112016032793544-pat00005

In one embodiment, in step (c), the biological path information amount of the domain can be calculated by a Domain Information Content Score (DOMICS) according to Equation (10).

[Equation 10]

Figure 112016032793544-pat00006

In one embodiment, a pathway-specific domain (PSD) may be selected according to the calculated domain information amount score.

In one embodiment, the disease-associated genes can be screened by applying the results of the biological pathway-specific domain (PSD) screening to Genome Wide Association Study (GWAS) results.

The method for determining the biological pathway specificity of the protein domain according to one aspect of the present invention can analyze and quantify the relationship between the protein domain and the biological pathway, thereby effectively providing meaningful data for discovering a disease-associated gene.

It should be understood that the effects of the present invention are not limited to the above effects and include all effects that can be deduced from the detailed description of the present invention or the configuration of the invention described in the claims.

FIG. 1 is a diagram illustrating a method of calculating a domain information content score (DOMICS) using a domain-based co-biotic path network. (A) A method of analyzing the similarity of a domain profile using a weighted mutual information amount (WMI) in a domain profile arranged with a Boolean value and constructing a co-biotic path network. (B) Protein-Pathway Association Score (PPA score). (C) a method of calculating a Domain-Pathway Association Score (DPA score). (D) a method of finally calculating the domain information amount score (DOMICS) value.
Figure 2 illustrates the association between biological pathway-specific domains and inherited diseases through mutations that inhibit protein interactions in the biological pathway. (A) A total of 49,636 domain-biological pathways were analyzed (DOMICS) between 5,253 InterPro domains and 407 GOBP biological pathways. (B) normalized variability (NVR) by neutral or disease-associated variants between PSD and NSD. (C) normalized variability (NVR) of PSD and NSD for mutants that partially or totally abolish protein interactions. (D) comparing the ratio of PSD and NSD according to different domain interaction links in the connecting domain (IFD) of a similar size protein structural interaction network (hSIN). (E) a relationship model between the mutation result and the domain interaction number.
Figure 3 shows the priorities of the candidate GWAS level genes using disease-associated PSDs. (A) summarizes methods for prioritizing coronary artery disease (CAD) and schizophrenia (SCZ) associated genes based on global field genome association analysis (GWAS) and domain information. (B) Based on SZdatabase annotation, we analyzed the accuracy of CAD gene prediction of GWAS ∩ PSD set, PSD only set, and GWAS only set. (C) (B) Based on SZdatabase annotation, we analyzed the accuracy of SCZ gene prediction of GWAS ∩ PSD set, PSD only set, and GWAS only set.
FIG. 4 is a graph showing the prediction of coronary artery disease (CAD) gene expression through analysis of a loss-of-function phenotype in zebrafish. (A) Tg ( flk1 : EGFP) Observation of morphological abnormality of the heart following morpholinoin injection of CAD candidate gene in zebrafish embryo. (B) Asymmetric phenotypes of hearts in Tg ( flk1 : EGFP) zebrafish embryos implanted with morpholino . (C) Morphine defects in morpholinoin injected Tg ( flk1 : EGFP) zebrafish embryos.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

When an element is referred to as "comprising ", it means that it can include other elements, not excluding other elements unless specifically stated otherwise.

Unless otherwise defined, can be performed by molecular biology, microbiology, protein purification, protein engineering, and DNA sequencing and routine techniques commonly used in the art of recombinant DNA within the skill of those skilled in the art. These techniques are known to those skilled in the art and are described in many standardized textbooks and references.

Unless otherwise defined herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.

Various scientific dictionaries, including the terms contained herein, are well known and available in the art. Although any methods and materials similar or equivalent to those described herein are found to be used in the practice or testing of the present application, some methods and materials have been described. It is not intended that the invention be limited to the particular methodology, protocols, and reagents, as they may be used in various ways in accordance with the context in which those skilled in the art use them.

As used herein, the singular forms include plural objects unless the context clearly dictates otherwise. Also, unless otherwise indicated, nucleic acids are written from left to right, 5 'to 3', amino acid sequences from left to right, amino to carboxyl.

Hereinafter, the present invention will be described in more detail.

According to an aspect of the present invention, there is provided a method of analyzing the similarity of a protein inter-domain profile based on (a) weighted mutual information (WMI) measurement method; (b) constructing a mutual co-pathway network of proteins based on the similarity of the protein domain profile; And (c) assessing the association between the protein domain and the biological pathway.

 In step (a), the similarity of the protein inter-domain profile can be analyzed based on the weighted mutual information (WMI) measurement method, and the protein domain profile information can be collected from a commercially available database.

The commercialized database is a database storing at least one sequence information for at least one domain, and may be InterPro, PROSITE, Pfam, SMART, or PRINTS. The database can define the domain through different methods, and it is sufficient if it can provide information related to which domain a specific sequence belongs to.

Specifically, the InterPro is a collection of domain-related information from various databases and includes a plurality of domain information. The PROSITE defines a functional site using regular expressions. The Pfam defines a domain using a HMM (Hidden Markov model), and predicts the domain based on the defined domain to digitize the probability value. The SMART (A Simple Modular Architecture Research Tool) defines a domain using a probability profile using Multiple Sequence Alignment (MSA). The PRINTS defines the domain using only multiple sequence alignments.

FIG. 1 is a diagram illustrating a method of calculating a domain information content score (DOMICS) using a domain-based co-biotic path network.

Degree 1, the domain profile may be defined as an array of Boolean values indicating presence or absence of a domain based on information of domains registered in InterPro, which is a commercially available database.

The false value may indicate the presence or absence of a specific domain in a protein by a numerical value of 0 or 1, and the collected domain profile may be utilized as data for judging similarity.

The similarity of the protein inter-domain profile can be analyzed on the basis of the weighted mutual information (WMI) measurement method in the step (a), and in the step (b) A co-pathway network can be constructed.

The domain profile may provide information about a protein that shares a similar domain profile, and domain similarity may be calculated as a mutual information (MI) score. The mutual information amount MI does not require a priori model and can provide high accuracy in various applications.

However, since the amount of information over individual domains varies, weights can be assigned to each protein domain according to the rarity of the domains for more useful domain generation patterns (FIG. 1A).

[Equation 1]

Figure 112016032793544-pat00007

remind c kj has a value of 1 when protein k contains domain j, and 0 otherwise.

The present inventors have developed a method of measuring the weighted mutual information (WMI) that assigns a weight to the rarity of a domain, while assuming that a generally prevalent domain is involved in various biological functions, while a rare domain is involved in a specific biological response .

The weighted mutual information (WMI) can be calculated by the following equation (2).

[Equation 2]

Figure 112016032793544-pat00008

H ω ( X ) and H ω ( Y ) are the weighted entropy of protein X and protein Y, and H ω ( X, Y ) is the weighted joint entropy of proteins X and Y.

At this time, the H ? ( X ), H ? ( Y ), and H ? ( X, Y ) can be defined by the following equations (3) and (4).

[Equation 3]

Figure 112016032793544-pat00009

[Equation 4]

Figure 112016032793544-pat00010

The similarity of the protein inter-domain profile can be analyzed by the weighted mutual information amount, and a co-pathway network can be constructed based on the similarity of the domain profile.

The classified pseudo-protein group may form a sub-network related to a specific biological pathway, and the proteins in the same sub-network may be connected to each other and connected to proteins belonging to other sub-networks to form a network (Fig. IA).

On the other hand, in step (c), the relationship between the protein domain and the biological pathway can be evaluated.

Specifically, the step (c) includes the steps of: evaluating a correlation between the classified protein and the biological pathway based on Bayesian statistics; And assessing the association between the domain and the biological pathway using the protein-biological pathway association and the domain profile; And measuring the amount of biological pathway information of the domain.

The sub-network constructed by the similarity of the domain profile can be specifically associated with a specific biological pathway, and the functional relationship between the proteins can be calculated by a log likelihood score (LLS).

The relationship between the protein and the biological pathway can be calculated by converting the Protein-Pathway Association Score (PPA score) according to Equations 5 to 7 into a probability score.

[Equation 5]

Figure 112016032793544-pat00011

P (L | E) and P (L | E) are the frequencies of positive (L) and negative (L) gold standard pathway gene links that can be observed in a given experimental or computed data (E) it means. On the other hand, P (L) and P (L) refer to the expected value of the dictionary, for example, the total frequency for all positive and negative Gold standard path gene pairs.

The gold standard pathway may be generated by pairing annotated proteins in the same term of Gene Ontology-Biological Process (GOBP).

The ontology can be defined as a set of formal vocabularies describing the concepts and concepts of concepts belonging to a specific domain as a data model representing a specific domain. For example, if the relationship between the species between the creature and the organism classified in the "item class" is described as a formal vocabulary, it can be defined as an ontology.

The gene ontology refers to a system that classifies biological terms or vocabularies provided by the Gene Ontology Consortium.

The Gold Standard Path Connection is information providing guidelines for establishing a co-pathway network, wherein a new connection created by the established co-path network is connected to the Gold Standard Path Connection It can be estimated that the network has been successfully established.

The log-likelihood score can be converted to a probability score that follows the power-function distribution. The p value of the power function distribution can be calculated by the following equation (6).

[Equation 6]

Figure 112016032793544-pat00012

Is a shape parameter determining the slope of the slope. As the? Increases, the p value of the power function distribution may increase exponentially.

For a specific protein i , the protein-biological pathway score (PPA score) can be calculated according to Equation (7) using the value converted by the power function distribution (FIG. 1B).

 [Equation 7]

Figure 112016032793544-pat00013

The S min ( f ) Is the minimum S i ( f ) value for a particular biological path f, and the α ( f ) can be calculated by:

[Equation 8]

Figure 112016032793544-pat00014

At this time, the S i ( f ) can be calculated by adding the log likelihood score (LLS). The sum of the log-likelihood ratios (LLS) means the sum of all protein inter-log likelihood scores linked to gene i in the network.

Meanwhile, the relationship between the domain and the biological pathway can be evaluated using the protein-biological pathway association score (PPA score) and the domain profile.

The association between the domain and the biological pathway can be calculated by a domain-pathway association score (DPA score) (FIG. 1C).

For a particular domain j , the domain-bioprosthesis association score (DPA score) may be calculated according to:

[Equation 9]

Figure 112016032793544-pat00015

In this case, K denotes a set of proteins including domain j , and c ij in the domain profile matrix denotes an i- th protein and a j- th domain cell.

In addition, the domain information content score (DOMICS) can be calculated using the domain-biological pathway association score (DPA score).

The domain information amount score (DOMICS) can be calculated according to the following equation (10).

[Equation 10]

Figure 112016032793544-pat00016

The DPA j (f) means the association score for the specific biological path f of the domain d j , and GC j means the Gini coefficient of the domain over the entire path.

The Gini coefficient can be calculated according to Equation (11).

[Equation 11]

Figure 112016032793544-pat00017

The Gini coefficient is maximized when the DPA score of a particular domain is the same throughout the biological path, and may be zero when the DPA score is given only for a single path.

A pathway-specific domain (PSD) can be selected according to the calculated domain information amount score.

The domain is classified into a pathway-specific domain (PSD) and a biological pathway-non-specific domain (NSD) according to the biological pathway specificity calculated by the domain information amount score (DOMICS). .

The threshold value of the domain information amount score may be set according to a predetermined criterion, and a domain exceeding the threshold value may be set as a bio-path-specific domain.

Since the biological pathway-specific domain has a relatively high score of domain information amount, it can be estimated that the disease-relatedness is higher than that of the biological pathway-non-specific domain.

According to another aspect of the present invention, there is provided a method for screening disease-associated genes using the biological pathway-specific domain (PSD) screening results.

Since the vast majority of disease genes are associated with biological pathways, quantitative levels of biological pathway-specific domains (PSDs) with high biological pathway specificity can be used as information to predict specific gene and disease-to-disease correlations. That is, when a specific gene contains a plurality of PSDs, the specific gene can be deduced as a disease-associated gene.

In addition, the selection method can utilize a Genome Wide Association Study (GWAS) result.

The full-length genome association assay (GWAS) can detect causative gene mutations that regulate disease through large-scale single nucleotide polymorphism (SNP) analysis, but is not finely controlled due to limited sample numbers and population structure The statistical power is insufficient.

Thus, the genes discovered by the full-length genome association assay may have little or no genetic linkage to disease, but accuracy of prediction can be improved by combining information on the bio-pathway-specific domain (PSD).

That is, since the method for determining biological pathway specificity according to an embodiment of the present invention is different from the method for finding a disease-associated gene as compared with the full-field genome association analysis method, information on the GWAS analysis result and biological pathway-specific domain (PSD) Can provide complementary predictive information on disease-associated genes.

The present invention will be further described with reference to the following examples, but it should be apparent that the present invention is not limited by the following examples.

Example  1: Create Domain Profile

The inventors used the BioMart search tool to download domain generation information of human proteins in the InterPro database (v38). A domain profile is generated as an array of Boolean values indicating the presence or absence of a specific domain as a numeric value of 0 or 1 based on the domain occurrence information.

A total of 8,362 InterPro domains were used to generate a domain profile for a gene encoding 17,013 human proteins.

Example  2 : DOMICS  Calculation

In order to construct a co-pathway network and calculate DOMICS, we analyzed the similarity of domain profiles generated based on weighted mutual information.

The DOMICS of the human InterPro domain associated with the GOBP pathway was calculated and the domain-biological pathway associations were sorted in descending order according to the calculated DOMICS values.

The present inventors used the gold standard pairing between the InterPro domain and the GOBP bioassay collected from the InterPro2GO annotation as a reference for the domain-biological pathway connection.

A total of 49,636 domain-bio-path associations were analyzed between 5,253 InterPro domains and 407 GOBP biological pathways.

Based on the biological pathway specificity of the domain produced by DOMICS, the human InterPro domain is classified as a Pathway-Specific Domain (PSD) and a Biological Pathway-Non-Specific Domain (NSD) .

The InterPro2GO annotation was used to collect a Gold Standard dataset of domain-biospecific pairings, which mapped the association between the InterPro domain and GOBP terms by manual curation.

The positive Gold Standard set included 2,535 domain-biospecific pairings consisting of 2,286 human InterPro domains and 374 GOBP terms.

The negative Gold Standard set also included 852,461 pairings between the 2,286 InterPro domains and 374 GOBP terms, ie, all remaining pairings.

Log likelihood scores and overlap between each of the 1,000 domain-biospecific pairing groups descended by the positive Gold Standard set and the DOMICS value were measured.

Based on DOMICS threshold level 0.056, domain-biospecific pairing was categorized as PSD and NSD and generated 16,000 domain-biospecific pairings containing 4,506 PSDs (FIG. 2A).

Thousands of domain-biospecific pairings above the threshold level represented positive logarithmic values and were significantly overlapped (p ≤ 0.01) compared to Gold Standard data. The remaining 3,856 InterPro domains were classified as NSD.

The value of DOMICS showed a strong positive correlation with the possibility of participating in the same GOBP pathway.

Example  3: PSD  And From NSD  Analysis of frequency of mutations associated with disease

We compared the frequency of PSD and NSD with disease-associated variants using pathogenic germ line variants collected from independent sources.

These variants were collected from i) SNPs from GWASdb, ii) OMIM disease gene mutations, and iii) Variants from the ClinVar database.

Approximately 1,610 SNPs associated with the GWAS trait derived from GWASdb were collected (p <1e-7) and mapped to dbSNP build 142, genome assembly, GRCh37 / hg19. By this method, 26,342 SNPs associated with the disease were classified.

Of the SNPs, only 966 (3.6% or less) were located in the region encoding the protein, and 569 SNPs were located in the InterPro domain region.

In order to analyze the germline mutants of cancer, 51 germline mutants associated with the GWASdb cancer study were collected and 20,945 somatic variants from TCGA breast cancer patients were collected.

Also, 1,779 and 10,778 variants were collected from the OMIM disease gene and dbSNP (http://www.ncbi.nlm.nih.gov/snp) derived from SwissVar (http://swissvar.expasy.org) respectively And generated an OMIMVar set of 11,024 OMIM disease-associated variants.

This led to the discovery of 9,050 variants located in the protein-domain region. ClinVar is a major public archive containing data on correlations between human variants and phenotypes. We collected 13,465 pathogenic SNPs in ClinVar and found 10,680 variants located in the protein domain region.

A null model was constructed using mutants expected to have a neutral effect derived from the HumVar neutral training set of Polyphen-2.

The HumVar Neutral Training Set consists of a common human nsSNP that does not contain disease-related annotations (allele frequency [MAF]> 1%), which is considered a noxious mutant.

To compare the frequency of neutral or disease-associated variants between PSD and NSD, the total number of variants for each domain, the variation rate (VR), was calculated according to the following equation in the entire genome region.

Figure 112016032793544-pat00018

Background variation rate (BVR) was calculated according to the following formula.

Figure 112016032793544-pat00019

The variability (VR) for the mutant set tested was normalized by the background variability (BVR) and the normalized variability (VR / BVR) was calculated.

Example  4: a group of domains containing different numbers of domain interactions PSD  And ratio analysis of NSD

590 interfacing domains (IFD) and 135,166 domain-domain interactions have been disclosed in the human structural structural human tissue structural interaction network (hSIN) (Nat Biotechnol 30, 159-164 2012), Wang et al.)

590 connection domains were classified as 345 PSDs and 245 NSDs, and PSDs reached 1.4 times the NSD (FIG. 2B).

We have determined the ratio of PSD and NSD to some of the connecting domains comprising different numbers of domain interactions according to the following formula:

Figure 112016032793544-pat00020

In the linkage domain (IFD), NSDs had more single interactions or more than 121 interactions compared to PSDs, but PSDs had a higher number of interactions (2 to 120 interactions) compared to NSDs 2D).

The present inventors have proposed a relationship model between the mutational effect of IFD and interactivity connectivity (Fig. 2E) in order to identify a higher frequency of PSD with an appropriate number of interactions between the connecting domains (IFDs).

Variations in the less connective domain interfere with the interactions of some proteins, so pathogenicity is not expressed and mutations may not be found in patients.

However, if mutations occur in the hub-interfacing domain, many protein-related interactions with various biological pathways are inhibited, leading to systemic failure. In this case, the mutation can result in a fatal phenotype that is not found in the population.

On the other hand, mutations in the connecting domain that have the appropriate number of interactions can inhibit protein interactions in the biological pathway in response to a range of biological path sizes.

Since the vast majority of disease genes are associated with biological pathways, such mutations can be found in patients by destroying disease-related biological pathways.

Thus, when a PSD is frequently observed in a domain that has a reasonable number of domain interactions, the PSD is closely related to a genetic disease caused by a mutation that inhibits protein interactions in the biological pathway.

Experimental Example  4 : PSD  Prioritization of gene candidates based on

The present inventors have anticipated that PSD can be usefully used for the identification of disease genes, considering that PSD can be closely related to genetic variation.

For example, full-length genome-wide association analysis (GWAS) is generally able to analyze the association of more than one million SNPs with each disease phenotype, but only a few candidates are identified due to a very significant significant threshold level (p ≤ 10e-7) can do.

However, the GWAS can typically detect a large number of candidate groups at an appropriate value below the threshold level (e.g., 10e-3? P <10e-7).

More candidate genes at the appropriate GWAS level can be exploited by meta-analysis by increasing the sample size, but this can be costly to implement.

The present inventors have assumed that disease-related features will enable discrimination between actual disease genes and non-disease genes in appropriate GWAS level genes.

Thus, it was tested whether disease-associated PSDs were able to identify disease genes in two GWAS data sets (CARDIoGRAM and PGC) that were commercialized (Figure 3A). The CARDIoGRAM contains research data on coronary artery disease (CAD), and the Psychiatric Genomic Consortium (PGC) contains research data on schizophrenia (SCZ).

We tested whether candidate genes at the appropriate GWAS level (1e-3? P <1e-7) could be successfully prioritized by disease-associated PSDs.

The CARDIoGRAM consortium performed meta-analysis using 22 European GWAS samples attributed to HapMap 2, which included 22,233 cases and 64,762 controls. PGC includes multistep schizophrenia GWAS, and includes 36,989 cases and 113,075 controls.

The appropriate GWAS level SNPs of 3,188 (out of 2,420,360) and 54,688 (out of 9,444,230) in the GWAS were associated with CAD and SCZ, respectively.

Each SNP was also assigned to a gene located within 10 kb (downstream or upstream) of the gene. In CAD and SCZ, 3,188 SNPs were assigned to 204 genes, and 54,688 SNPs were assigned to 1,044 genes, respectively.

Meanwhile, in order to identify the PDS associated with CAD and SCZ, the GOBP biological pathway associated with CAD or SCZ was identified through Fisher's accuracy test (p <0.01), and the degree of overlap between the GOBP terminology and each disease was confirmed. CAD or SCZ associated genes were collected from the OMIM and DO databases. 212 CAD disease-associated genes were collected and 233 SCZ disease-associated genes were collected.

The genes were then prioritized by the number of PSDs associated with the disease. We have identified a PSD associated with our CAD or SCZ. PSD-biospecific associations were converted to PSD-disease associations based on overlapping between disease-associated genes and bio-pathway-associated genes.

At this time, the GOBP pathway containing at least 5 genes was considered. 2,664 PSDs and CADs were connected (Table S2, S3) via 97 CAD-related GOBP pathways in combination with biological pathways, associations between CAD or SCZ, and PSD-biopathology associations (Table S2, S3) Through the path, 2,517 PSD and SCZ were connected.

Based on the number of CAD or SCZ associated PSDs, the candidate genes at the appropriate GWAS level were given priority. 202 and 934 candidate genes including at least one disease-associated PSD in CAD and SCZ were identified.

On the other hand, among the above genes, PSDs associated with three or more diseases were selected (GWAS ∩ PSD set), 38 genes in CAD, and 157 genes in SCZ were selected.

In addition, only the appropriate GWAS level (GWAS set) or only the disease-associated PSD (PSD set) was selected.

For verification of the gene set, 604, 937 genes contained in CAD and SCZ derived from a disease-specific database (CADgene V2.0 and SZdatabase), respectively, were collected. Excluding the CAD and SCZ genes used to correlate biological pathways for conservative validation, 466 CAD genes and 767 SCZ genes were selected as final validation sets.

In order to compare predictions according to GWAS level or PSD level, a set (GWAS set) containing predictions based on p-values of genes at appropriate GWAS levels and a disease-associated PSD number of genes at appropriate GWAS level and low GWAS level Set (PSD set).

Comparing the GWAS set and the PSD set, more than 30% more CAD genes were observed in the PSD set (Fig. 3B) and the results were more pronounced in the SCZ (Fig. 3C).

In addition, since the prediction accuracy is significantly increased when the combination of GWAS and PSD is combined in CAD and SCZ, the above results suggest that GWAS and PSD can complement each other to provide disease prediction information.

Example  5: Zebrafish  Experimental verification of candidate gene

We have experimentally verified the prediction of GWAS ∩ PSD sets through morpholino-based loss-of-function phenotype analysis of zebrafish.

Despite the fact that the vast majority of human disease genes have zebrafish orthologs, some phenotypic phenotypes, such as psychiatric disease, are not immediately found in zebrafish. Therefore, predictive tests were performed on coronary artery disease (CAD) genes.

In the GWAS ∩ PSD set, we found 23 zebrafish osologs for 38 human CAD candidate genes. Four candidate candidate genes ( tram1 , apod , cypna1, and slc22a2 ) were selected after the association with CAD was known or the genes ranked high by GWAS were excluded.

The zebrafish model in CAD has not yet been established. However, the present inventors have found that the human orthologs (207) of the genes involved in the development of heart or blood vessels of zebrafish are OMIM or DO annotations (p <1.29e-4, Fisher's accuracy test) or CADgeneDB (p < 3, Fisher's accuracy test), and confirmed that CAD and cardiovascular events are closely related at the biological pathway level.

In other words, CAD genes were identified based on abnormal cardiac and vascular phenotypes during embryonic development of zebrafish.

To confirm the possibility of CAD gene identification based on cardiac and vascular phenotype, atp2a2b associated with CAD was used as a positive control.

The morpholino of the test gene was microinjected into the embryo and the expression of the heart and blood vessels was examined using a fluorescence stereomicroscope (Fig. 4A).

At this time, in the majority of embryos implanted with morpholino , the phenotype of the heart or blood vessel was abnormal in not only the CAD-associated atp2a2b but also the three candidate genes ( tram1 , cypna1 , slc22a2 ) among the four candidate groups , Strongly suggests the association of genes and CAD (Figs. 4B and 4C).

It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

The scope of the present invention is defined by the appended claims, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included within the scope of the present invention.

Claims (8)

A method for determining biological pathway specificity of a protein domain performed in a computer processor,
(a) analyzing the similarity of a protein inter-domain profile based on weighted mutual information (WMI) measurement;
(b) constructing a mutual co-pathway network of proteins based on the similarity of the protein domain profile; And
(c) assessing the association between the protein domain and the biological pathway,
The method of measuring weighted mutual information quantitatively determines a biological pathway specificity of a protein domain calculated by the following equation (2) by assigning a weight value calculated by equation (1) defined according to the rarity of a domain to each protein domain.
[Equation 1]
Figure 112017126940396-pat00031

The c kj has a value of 1 when the protein k includes the domain j, and a value of 0 otherwise.
[Equation 2]
Figure 112017126940396-pat00032

H ω (X) and H ω (Y) are the weighted entropy of protein X and protein Y, and H ω (X, Y) is the weighted joint entropy of proteins X and Y.
delete The method according to claim 1,
(C) evaluating a correlation between a protein and a biological pathway belonging to a co-biological pathway network based on Bayesian statistics;
Evaluating the association between the domain and the biological pathway using the protein-biological pathway association and the domain profile; And
And measuring the amount of biological pathway information of the domain.
The method of claim 3,
In the step (c), the association between the protein and the biological pathway belonging to the co-biological pathway network is calculated by converting the Protein-Pathway Association Score (PPA score) to a probability score according to the following expressions How to judge.
[Equation 5]
Figure 112016032793544-pat00022

[Equation 6]
Figure 112016032793544-pat00023

[Equation 7]
Figure 112016032793544-pat00024
The method of claim 3,
In the step (c), the association between the domain and the biological pathway is calculated as a Domain-Pathway Association score (DPA score) according to Equation (9).
[Equation 9]
Figure 112017126940396-pat00025

K denotes a set of proteins including domain j, and cij in the domain profile matrix denotes an i-th protein and a j-th domain cell.
The method of claim 3,
Wherein in step (c), the biological path information amount of the domain is calculated by a Domain Information Content Score (DOMICS) according to Equation (10).
[Equation 10]
Figure 112017126940396-pat00026

The DPAj (f) denotes the association score for the specific biological path f of the domain dj, and GCj denotes the Gini coefficient of the domain over the entire path.
The method according to claim 6,
And determining a pathway-specific domain (PSD) according to the calculated domain information amount score.
8. The method of claim 7,
A method for selecting a disease-associated gene by applying the result of the biological pathway-specific domain (PSD) screening to a Genome Wide Association Study (GWAS) result.
KR1020160041518A 2016-04-05 2016-04-05 Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes KR101853916B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160041518A KR101853916B1 (en) 2016-04-05 2016-04-05 Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160041518A KR101853916B1 (en) 2016-04-05 2016-04-05 Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes

Publications (2)

Publication Number Publication Date
KR20170114504A KR20170114504A (en) 2017-10-16
KR101853916B1 true KR101853916B1 (en) 2018-06-20

Family

ID=60295731

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160041518A KR101853916B1 (en) 2016-04-05 2016-04-05 Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes

Country Status (1)

Country Link
KR (1) KR101853916B1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200372972A1 (en) * 2017-11-13 2020-11-26 Industry-University Cooperation Foundation Hanyang University Sample data analysis method based on genomic module network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060069512A1 (en) * 1999-04-15 2006-03-30 Andrey Rzhetsky Gene discovery through comparisons of networks of structural and functional relationships among known genes and proteins
KR20110035716A (en) * 2009-09-30 2011-04-06 이화여자대학교 산학협력단 The method for searching signaling pathway of protein using gene ontology and the system thereof, and the method for evaluating signaling pathway of protein
KR20120089034A (en) * 2011-02-01 2012-08-09 충북대학교 산학협력단 Method for Identifying Cancer Related Protein Domains
KR20150092780A (en) * 2014-02-05 2015-08-17 연세대학교 산학협력단 Improvement method of gene network using domain-specific phylogenetic profiles similarity
KR101568399B1 (en) * 2014-12-05 2015-11-12 연세대학교 산학협력단 Systems for Predicting Complex Traits associated genes in plants using a Arabidopsis gene network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS641469A (en) * 1987-06-23 1989-01-05 Mitsubishi Electric Corp Hydromagnetic actuator
KR101629773B1 (en) * 2014-07-16 2016-06-13 한국과학기술원 Device for selecting candidate of drug peptide and method for selecting candidate of drug peptide using the same

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060069512A1 (en) * 1999-04-15 2006-03-30 Andrey Rzhetsky Gene discovery through comparisons of networks of structural and functional relationships among known genes and proteins
KR20110035716A (en) * 2009-09-30 2011-04-06 이화여자대학교 산학협력단 The method for searching signaling pathway of protein using gene ontology and the system thereof, and the method for evaluating signaling pathway of protein
KR20120089034A (en) * 2011-02-01 2012-08-09 충북대학교 산학협력단 Method for Identifying Cancer Related Protein Domains
KR20150092780A (en) * 2014-02-05 2015-08-17 연세대학교 산학협력단 Improvement method of gene network using domain-specific phylogenetic profiles similarity
KR101568399B1 (en) * 2014-12-05 2015-11-12 연세대학교 산학협력단 Systems for Predicting Complex Traits associated genes in plants using a Arabidopsis gene network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Bioinformatics 32(18):2824-2830 (2016.05.20.) *
BioMed Research International, 2014:641469 (2014) *
FEBS Letters, 583:1703-1712 (2009) *

Also Published As

Publication number Publication date
KR20170114504A (en) 2017-10-16

Similar Documents

Publication Publication Date Title
Yousefi et al. DNA methylation-based predictors of health: applications and statistical considerations
Schrider et al. Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia
Lage Protein–protein interactions and genetic diseases: the interactome
Goldstein et al. Sequencing studies in human genetics: design and interpretation
Hohenlohe et al. Population genomic analysis of model and nonmodel organisms using sequenced RAD tags
CN115698335A (en) Predicting disease outcome using machine learning models
Paris et al. Sex bias and maternal contribution to gene expression divergence in Drosophila blastoderm embryos
CN115997255A (en) Molecular techniques for predicting bacterial phenotypic traits from genome
Hopkins et al. Phenotypic screening models for rapid diagnosis of genetic variants and discovery of personalized therapeutics
DeGiorgio et al. A spatially aware likelihood test to detect sweeps from haplotype distributions
Kopp et al. Moving from capstones toward cornerstones: successes and challenges in applying systems biology to identify mechanisms of autism spectrum disorders
Flassig et al. An effective framework for reconstructing gene regulatory networks from genetical genomics data
Yuan et al. Alignment of cell lineage trees elucidates genetic programs for the development and evolution of cell types
Soneson et al. Bias, robustness and scalability in differential expression analysis of single-cell RNA-Seq data
Martínez-Redondo et al. Illuminating the functional landscape of the dark proteome across the Animal Tree of Life through natural language processing models
KR101853916B1 (en) Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes
Amariuta et al. In silico integration of thousands of epigenetic datasets into 707 cell type regulatory annotations improves the trans-ethnic portability of polygenic risk scores
Widmayer et al. Evaluating the power and limitations of genome-wide association mapping in C. elegans
Liu et al. Brain transcriptional regulatory architecture and schizophrenia etiology converge between East Asian and European ancestral populations
Chong et al. SeqControl: process control for DNA sequencing
Selewa et al. Single-cell genomics improves the discovery of risk variants and genes of atrial fibrillation
Mahlich et al. Low diversity of human variation despite mostly mild functional impact of de novo variants
Jagadeesh et al. S-CAP extends clinical-grade pathogenicity prediction to genetic variants that affect RNA splicing
Cakiroglu et al. Chromwave: deciphering the dna-encoded competition between transcription factors and nucleosomes with deep neural networks
US20220189581A1 (en) Method and apparatus for classification and/or prioritization of genetic variants

Legal Events

Date Code Title Description
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant