KR101810527B1 - Algorithm for the construction of a regulatory network for more than 10,000 genes and method for the identification of causal genes in drug responses using the same algorithm - Google Patents

Algorithm for the construction of a regulatory network for more than 10,000 genes and method for the identification of causal genes in drug responses using the same algorithm Download PDF

Info

Publication number
KR101810527B1
KR101810527B1 KR1020150063824A KR20150063824A KR101810527B1 KR 101810527 B1 KR101810527 B1 KR 101810527B1 KR 1020150063824 A KR1020150063824 A KR 1020150063824A KR 20150063824 A KR20150063824 A KR 20150063824A KR 101810527 B1 KR101810527 B1 KR 101810527B1
Authority
KR
South Korea
Prior art keywords
gene
network
networks
genes
link
Prior art date
Application number
KR1020150063824A
Other languages
Korean (ko)
Other versions
KR20160132223A (en
Inventor
최정균
양우진
김권일
Original Assignee
한국과학기술원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국과학기술원 filed Critical 한국과학기술원
Priority to KR1020150063824A priority Critical patent/KR101810527B1/en
Publication of KR20160132223A publication Critical patent/KR20160132223A/en
Application granted granted Critical
Publication of KR101810527B1 publication Critical patent/KR101810527B1/en

Links

Images

Classifications

    • G06F19/12
    • G06F19/22

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

(I, j) having a link value indicative of a regulatory relationship between a gene i and a gene j, the method comprising the steps of: * N square matrix. ≪ / RTI >

Description

[0001] The present invention relates to an algorithm for constructing transcriptional regulatory network between genes having more than 10,000 genes, and a gene discovery method using the same. [0002]

The present invention relates to a computer technology for generating a gene-regulated network model exhibiting a regulatory relationship between genes.

Due to the accumulation of biological Bigdata by next generation sequencing (NGS) technology, the trend of genome research is changing. Previously, hypotheses were first established and proved through experiments, but it is common to conduct additional experiments after preliminary analysis of published biological big data.

However, this strategy is only adopted by a small number of researchers because it is possible to incorporate information technology as well as existing biological experiments. For example, over 1,000 terabytes of data are now available on the NCBI website. In order to reanalyze the NGS data, it is necessary to 1) to find out the necessary parts for the hypothesis from the published data, 2) to acquire the data source and to standardize the file format according to the purpose, 3) to confirm the FASTQ file information Knowledge required to map to a specific genome, and 4) a variety of analytical knowledge such as knowledge necessary to visualize the results and give meaning through statistical processing.

This analysis process requires a knowledge of each process and is a somewhat difficult process for general experimental researchers to carry out, so a methodology or device for extracting intuitive and interpretable information from enormous amounts of genomic information is needed.

Gene regulation networks are essential to reveal the causative genes that are key in the disturbance of several genes found in each disease. Microarrays and RNA-seq experiments, which were previously used for disease studies, have been successful in determining which genes are disrupted in certain diseases, but there are limitations in characterizing causative genes. If a causative gene for a specific disease can be identified, it will be possible to study at the genetic level for the pathogenesis mechanism.

In addition, establishing a global gene regulatory network for a particular cell line can explain genes that are differentially expressed before and after disease and normal or drug treatment, and provide a basis for understanding the mechanism of these differentially expressed genes. Therefore, there is a great demand for medicine and pharmacy in the future. In particular, complex diseases such as cancer are often disturbed by the gene regulation network, and even when the same phenomenon occurs, they have different pathogenic mechanisms, and statistical reasoning is often difficult. For example, through simple statistical inference, it is very difficult to predict the response or adverse effects of genes on a particular drug. Therefore, it is necessary to analyze by using a network model that integrates transcript data of multiple samples and genetic data such as genetic factors and mutations discovered by GWAS.

The most important model for constructing gene expression networks is through transcription factors. Transcription factors are one of the most important factors in gene expression regulation because they regulate gene expression by binding directly to the promoter or enhancer of the target gene. In order to infer whether transcription factors are involved in the regulation of specific genes, it is necessary to model the expression of the target gene around the genomic information.

With the development of next generation sequencing (NGS) methods, large projects such as ENCODE have produced vast quantities of abundant abundant genome and transcript information for hundreds of cell lines. Large-scale observations of gene expression patterns using microarrays and RNA-seq have been published in large databases. From this information it is possible to extract all possible gene control pairs from a particular cell line. If only the control pairs that best describe actual observations are left, it is possible to construct a gene regulation network. However, specific models and software have not yet been systematized. Thus, the present invention provides a method for analyzing the mutual regulatory relationship between genes based on the phagocytic genomic data and the transcription factor model.

In the meantime, DEG data has been extracted through experiments comparing disease and normal or comparing specific treatments such as drugs. However, DEG digestion only allows understanding of which genes are responsible for the upper part through a complex transcriptional control network There is a problem that it is impossible. In addition, the methodology for developing existing therapeutic drugs has a problem that it requires a very complicated process and an excessive cost and time accompanying therewith.

Two existing techniques that may be an alternative to the present invention are the Bayesian network methodology and the MCMC methodology using metabolome data as follows. Herein, the superiority of the present invention will be described by comparing the method according to the present invention with the following two methods.

1. Bayesian network methodology

The development of next-generation sequencing (NGS) technology, which allows sequencing of genomic sequences at high speed, has enabled the study of whole genome sequences of many species, and the use of large-scale microarrays It enables simultaneous observation of the expression of thousands of genes and provides a lot of data for gene expression studies. Based on these data, many researchers have sought to identify genetic factors associated with disease through a large population of patients to find disease mechanisms, leading to the discovery of thousands of genes associated with hundreds of diseases and traits. However, the mechanism of action of the majority of genetic mutations found in regions other than the 1% protein sequence remains unknown and further research is required.

Among them, research has been focused on the real 'cause' of disease by analyzing the location and state of target genes under control of genetic variation on regulatory networks. Recently, a Bayesian network model has been used as an appropriate methodology for this. Since the Bayesian network model is based on the conditional probability based on the cause and effect between various factors, it is suitable to find out the causal relationship between the genes related to the disease, and it is necessary to identify the key genes causing specific diseases. It is a model. Using this Bayesian network, a methodology for constructing a genetic control network model based on the information on the amount of gene expression and information extracted through gene expression QTL for a specific disease has been developed in recent 10 years The system has been studied and developed so that software tools have been built and there are already many results using them.

2. MCMC methodology using metabolome data

The existing MCMC methodology was developed as software called RIMBANet. Jun Zhu, the developer of this software, has been developing software since the early 2000s and has conducted model studies to verify the methodology. In the late 2000s and early 2010, papers on human disease using this methodology came from different groups. Original author Jun Zhu presented this model in 2012 to integrate not only genomic information but also metabolism information.

The authors of this paper have shown that the presence or absence of metabolites can have a significant impact on gene expression and emphasize the use of metabolic information. This became the main motif of the paper. The fact that simple life forms such as yeast show dramatic gene expression changes in response to changes in metabolic processes suggests that the same kind of research is also required in cells of higher animals affected by various hormones and signaling substances and microenvironment.

In the present invention, a method for analyzing the mutual regulatory relationship between genes based on the epigenetic data and the transcription factor model is presented. In particular, an efficient analysis method is proposed by presenting a gene regulation network using a data expression type proposed in the present invention. In addition, we propose an efficient analysis method by introducing and using a fitness score for each gene control network in the course of using an evolutionary algorithm.

For the present invention, a method for collecting and processing public data has been systemized to collect large amounts of data for network integration research. From large-capacity genomic data, such as ChIP-seq experiments that can detect the location of a specific protein bound to a genome, or experiments on DHS that have the potential to bind the gene so that the gene can be expressed, Possible gene pairs were extracted. In the present invention, based on this information, a method of constructing a cause-and-effect relationship for a gene on a network can be suggested.

For the present invention, in addition to the existing control relation estimation method, a three-dimensional structure of the dielectric is constructed as a mathematical model. From the data derived from the 3-dimensional reproductive genome experiments such as Hi-C, we can measure the frequency of the interaction of DNA on the genome, and based on this, a model of the structure of DNA in actual cells is presented . Based on this model, it is possible to predict all possible control pairs that are relevant to this part of the genome for all parts that can induce gene expression.

In addition, the present invention proposes a method of constructing a more accurate cause-and-effect relationship network by adding a machine learning algorithm based on an evolutionary algorithm to the methodology used above. This algorithm constructs a conditional probability based model from gene expression data and extracts a network that best describes the expression of all genes in all possible gene control pairs. We constructed an optimal gene regulation network by evaluating randomly selected regulatory pairs based on the conditional probability model and mating and crossing the set with better scores. In particular, we have structured the process from the collection of data to the construction of cause - effect network, and have developed a standardized software tool that can be applied to various experimental methods.

The software developed above can be used to construct a gene regulation network in several disease samples. According to the present invention, an algorithm for further obtaining biologically and medically meaningful information from the network constructed above can be presented. One of them is an algorithm that finds out the causative genes that can best explain gene expression by drug reaction so that each sample can understand the changes that occur in response to various drugs or anticancer drugs. This algorithm can predict which drugs will affect the treatment of a particular disease and can help to search for candidate drugs.

A method according to an aspect of the present invention uses a genetic algorithm.

When analyzing the relationship between genes using gene expression information, it is general that there are many parameters of the network model. This increases the complexity of the network structure, and it takes too much time to learn the control network. Finding the optimal solution for the Bayesian network has also been found to be an NP-hard problem, which means it is not possible in real time. As an alternative, MCMC-like methodology is used as a heuristic approach. However, in order to construct a network of more than 10,000 genes as in actual research, it has been confirmed that it takes more than several months in a normal PC environment to learn a model . In order to overcome the slow rate of such MCMC, a genetic algorithm is used in the present invention as an alternative.

A genetic algorithm is a method of simulating the evolution process of the natural world and used in problem solving or simulation. It expresses the solution to the problem to be solved by the chromosome, Gradually evolving is a way of generating increasingly better solutions as the generation passes. This method mainly represents the answer to the problem as a gene-like phenotype. Since it is a method of finding the desired answer by evolving a gene by using it as an evolutionary operator such as a crossover or a mutation, (evolutionary algorithm). The evolutionary algorithm is characterized in that it performs parallel searches involving multiple entities in a large problem space and has the advantage of facilitating global search compared to other algorithms through mating and mutation .

Genetic algorithms are mainly used for large-scale operations, and they are very efficient for parallel processing, and can be developed with parallel processing in mind. Generally, in order to use the parallel processing algorithm, a high-performance operation server using a multi-core CPU is introduced and an infrastructure is constructed so that performance can be improved. For higher performance parallelization, a general-purpose GPU computing method using a graphics processor is recently used, and this technique is also used in a supercomputer.

In the present invention, for a model using an evolutionary algorithm for constructing a cause-effect network between genes, a method for expressing a candidate network as a chromosome, a chromosome initialization, a natural selection, a breeding, a mutation, And how to output the final result when a suitable network solution is found.

According to one aspect of the present invention, there is provided a method for controlling a gene, which can be represented by an N * N square matrix having a link value indicating an adjustment relationship from a gene i to a gene j as an (i, j) (0 < i, j < = N), which generates a gene regulation model using a network. The method includes crossing the two gene control networks by exchanging elements of two square matrices representing two gene control networks among a plurality of gene control networks represented by the square matrix, It uses an evolutionary process that evolves the network. And repeating the evolution process a number of times in accordance with a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks.

Generating a plurality of seed networks by performing the seed network creation process a plurality of times; And generating a consensus network by selecting link values repeatedly appearing in a plurality of seed networks according to a predetermined rule among the plurality of seed networks.

According to another aspect of the present invention, there is provided a gene regulatory model generation method (0 < i, j < = 0) generating a gene regulatory model using a gene regulatory network defined by link values representing a regulatory relationship from gene i to gene j N) &lt; / RTI &gt; The method further comprises replacing at least some of the link values included in the second gene control network among the plurality of gene control networks with corresponding link values included in the first gene control network, 2 &lt; / RTI &gt; gene regulation network, including the step of crossing the gene regulation network. And repeating the evolution process a number of times in accordance with a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks.

According to another aspect of the present invention, there is provided a gene regulatory model using a gene regulatory network that can be represented by an N * N square matrix having a link value indicating an adjustment relationship from a gene i to a gene j as an (i, j) (0 < i, j < = N). The apparatus includes crossing the two gene control networks by exchanging elements of two square matrices representing two gene control networks among a plurality of gene control networks represented by the square matrix, And to perform an evolutionary process that evolves the network. And repeating the evolution process a number of times in accordance with a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks, and generating a seed network.

According to another aspect of the present invention, there is provided a gene regulatory model generating apparatus (0 < i, j < j) for generating a gene regulatory model using a gene regulatory network defined by link values indicating a regulatory relationship from gene i to gene j = N). The apparatus may further comprise means for replacing at least some of the link values contained in the second gene control network among the plurality of gene control networks with corresponding link values included in the first gene control network, 2 &lt; / RTI &gt; gene control network, wherein the plurality of gene regulatory networks are capable of evolving. And repeating the evolution process a number of times in accordance with a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks, and generating a seed network.

According to another aspect of the present invention, there can be provided a gene regulation network evolution method for evolving a gene regulation network using an intersection between gene regulation networks. At this time, the gene regulatory network is represented by an N * N square matrix having a link value indicating the control relation from gene i to gene j as the (i, j) th element. At this time, the mating can be performed using a step of copying the elements of the square matrix representing the first gene regulation network to the corresponding positions of the second gene regulation network.

According to the present invention, a global transcription regulatory network can be constructed by analyzing the mutual regulatory relationship between genes based on the reproductive genome data and the transcription factor model. As a result, meaningful information can be extracted from data from differentially expressed genes (DEGs) discovered by microarrays or RNA-seq.

In particular, it is possible to efficiently analyze the gene regulation network using the data expression type and the fitness score proposed in the present invention.

Previous studies have shown that it is not possible to understand which genes work in the upper part due to complex transcriptional regulation networks. However, according to the present invention, in order to secure the disadvantages of the conventional research, it is necessary to integrate large-scale genomic information, transcript information such as DNA interaction of transcription factors, and a three- A network model can be provided.

A method for estimating a change in gene expression due to use of a drug by analyzing a change in response of the gene to various drugs using the network constructed according to the present invention can be provided. This can help you find candidate medicines for many diseases.

Furthermore, since the gene expression regulatory network is constructed globally for all genes using the present invention, it is expected that the cost and time for searching for drug candidates will be saved because the drug reaction in the human body can be predicted in advance.

FIG. 1 is a flowchart showing a method of generating a gene regulation model according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of the analogy of control relationships between genes sharing the same mutation.
3 is a diagram illustrating a logical expression of a gene regulation network according to an embodiment of the present invention.
4 is a schematic diagram of an operation of a crossing operator for a logical expression type according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments described herein, but may be implemented in various other forms. The terminology used herein is for the purpose of understanding the embodiments and is not intended to limit the scope of the present invention. Also, the singular forms as used below include plural forms unless the phrases expressly have the opposite meaning.

&Lt; Example 1 >

Hereinafter, a method for generating a gene regulation model according to an embodiment of the present invention will be described. Here, the term 'gene regulation model' refers to a gene regulatory network in which a plurality of genes have nodes as a link, and the regulatory relation between the genes is expressed as a link.

For this method, we can construct an N * N square matrix with a total of N genes to be considered in rows and columns. The (i, j) th element of this square matrix may have a link value (0 <i, j <= N) indicating the control relationship from gene i to gene j. A plurality of square matrices of this type may be generated, and each square matrix may correspond to each gene control network to be described below.

A method of generating a gene regulation model using a gene regulation network using the gene regulation network is characterized in that the elements of two square matrices representing two gene regulation networks among a plurality of gene regulation networks that can be expressed using the square matrix And exchanging data. At this time, when the element is exchanged, only one of the two square matrices transmits its own element to another matrix, and the other matrix may not transmit its own element to any one of the matrices have.

The two gene regulation networks can be crossed through the exchange process. The plurality of gene regulation networks may be evolved, including the crossing step, which may be referred to as an &quot; evolution process. &Quot; In this evolutionary process, not only the two gene regulation networks of the plurality of gene regulation networks are crossed but also the other pairs of gene regulation networks can cross each other. It can be interpreted that a new generation has occurred once the evolution process has occurred.

In this embodiment, the above-described evolution process can be repeated a number of times according to a predetermined rule. The predetermined rule may be determined by experimental experience or may be determined by an equation.

Then, a new gene regulation network can be generated by selecting a link value that is present only in a plurality of gene regulation networks according to predetermined rules among the plurality of gene regulation networks. This new gene regulation network can be referred to as a seed network, and the process of generating one seed network can be referred to as a seed network generation process.

The new gene control network may be generated by filling the corresponding elements of the N * N matrix into the corresponding elements.

The seed network may be used as the gene regulation model described above. However, the seed network may be a local optimal solution of the gene regulation model sought in the present invention. Therefore, the above seed network generation process can be repeated a plurality of times in order to find a solution close to the global optimum solution.

That is, the method of generating a gene regulation model according to an embodiment of the present invention may further include generating a plurality of seed networks by performing the seed network generation process a plurality of times.

Next, another new gene control network may be created by selecting link values repeatedly appearing in the seed networks of more than a predetermined number of seed networks among the plurality of generated seed networks. This new gene regulation network may be referred to as a 'consensus network', and the process of creating the one consensus network may be referred to as a 'consensus network creation process'.

The consensus network can be finally regarded as a gene regulation model described above. That is, the consensus network can be regarded as the global optimal solution of the gene regulation model sought in the present invention.

FIG. 1 is a flowchart showing a method of generating a gene regulation model according to an embodiment of the present invention.

A method for generating a gene regulation model according to an embodiment of the present invention includes:

An evolutionary process (PE) may be used to evolve the plurality of gene regulation networks using a step 120 of crossing some or all of the plurality of gene regulation networks with each other.

Next, the gene regulation model generation method comprises:

Repeating (S210) an evolution process for evolving a plurality of gene regulation networks a number of times according to a predetermined rule; And

A step S220 of generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks,

Lt; RTI ID = 0.0 &gt; (PS) &lt; / RTI &gt;

Next, the gene regulation model generation method comprises:

Performing the seed network creation process (PS) a plurality of times to generate a plurality of seed networks (S310); And

Selecting a link value repeatedly appearing in a plurality of seed networks according to a predetermined rule among the plurality of seed networks to generate a consensus network (S320)

A consensus network creation process (PC) may be used.

In the above-described embodiment, the link value of the (i, j) th element has a value of 0 when the gene i does not regulate the gene j and a positive value when the gene i is positive control of the gene j, +1), and the gene i may have a negative value (ex: -1) when the gene i is negative-regulated.

And the step of intersecting (S120) further comprises the step of selecting, from among the two gene regulation networks, a part of the first square matrix representing the first gene regulation network, (S121) of copying the matrix to the corresponding position of the square matrix.

In this case, the evolution process (PE) may further include classifying each gene regulation network into a plurality of groups according to a predetermined rule (S110) before the intersecting step (S120). At this time, the number of elements of some of the first square matrix may be changed according to the group to which the first gene control network belongs.

Specifically, the classifying step (S110) comprises calculating a fitness score of each of the gene regulation networks obtained based on the conditional probability obtained from the information on the amount of expression of genes represented by each row of the N * N square matrix (S111) ; And classifying each of the gene regulation networks into the plurality of groups according to the calculated fitness scores (S112). At this time, in the copying step S121, the number of elements of the first square matrix may increase as the fitness score of the first gene control network increases.

The evolution process (PE) further comprises a mutation step (mutation step) for arbitrarily changing values of at least some of the elements of the gene regulation network of at least some of the plurality of gene regulation networks after the crossing step (S120) (S130). This can be seen as a simulation of mutations in evolution by randomly changing the linkage of links that may exist in the gene regulation network.

In the method of generating a gene regulation model, the method may further include initializing the plurality of gene regulation networks (S10) before performing the evolution process (PE). The initialization may be performed by using the plurality of gene control networks with a priori probability based on eQTL, TF-binding, TF-motif, and three-dimensional chromatin interaction structure, (S11) based on the prior probability. In this case, when there is a control probability that the gene i controls the gene j in the pre-probability, the probability that a non-zero value exists in the (i, j) th element of each matrix representing each gene control network, Lt; / RTI &gt; For example, the probability that a non-zero value exists in the (i, j) th element of each matrix may be proportional to the magnitude of the adjustment probability. Not being '0' here can mean that positive or negative control can be considered to be present.

In addition, the initialization may further include a step (S12) of assigning a link value to an arbitrary element of each of the gene regulation networks based on the pawson distribution.

&Lt; Example 2 >

Hereinafter, a method for generating a gene regulation model according to another embodiment of the present invention will be described.

This method is based on a method for generating a gene regulatory model (0 <i, j <= N) that generates a gene regulatory model using a gene regulatory network defined by link values representing the regulatory relationship from gene i to gene j will be. The regulatory relationship may be any of positive, negative, or non-regulated. The link value may have a value of +1 for positive control, a value of -1 for negative control, and a value of 0 for non-control.

The method further comprises the step of replacing at least some of the link values included in the second gene control network among the plurality of gene control networks with corresponding link values included in the first gene control network (S1121) 1 gene control network and the second gene control network (S1120), an evolutionary process (PE2) for evolving the plurality of gene regulation networks can be used.

The method may further include repeating (S1210) the evolution process PE2 a number of times in accordance with a predetermined rule; And generating a seed network (S1220) by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks (S1220) Can be used.

The method may further include: (S1310) generating a plurality of seed networks by performing the seed network creation process (PS2) a plurality of times; And generating a consensus network (S1320) by selecting link values repeatedly appearing in a plurality of seed networks according to a predetermined rule among the plurality of seed networks (S1320) Can be used.

At this time, the evolution process PE2 may further include a step S1110 of classifying each of the gene regulation networks into a plurality of groups according to a predetermined rule. In this case, the number of the replaced link values may be changed according to the group to which the first gene control network belongs in step S1121. At this time, the first gene regulation network and the second gene regulation network may belong to different groups among the plurality of groups.

The evolution process (PE2) includes calculating (S1111) a fitness score of each gene control network obtained based on a conditional probability obtained from information on the expression level of a gene contained in each gene regulation network; And classifying each of the gene regulation networks into the plurality of groups according to the calculated fitness score (S1112).

In the replacement step (S1121), the number of the replaced link values may increase as the fitness score of the first gene control network increases.

And the evolution process PE2 further comprises a transition step S1130 of arbitrarily changing at least some of the link values of at least some of the gene regulation networks of the plurality of gene regulation networks after the intersecting step S1120 .

The method of generating a gene regulation model of this embodiment may further include initializing the plurality of gene regulation networks (S1010) before performing the evolution process (PE2). At this time, the initialization may include generating (S1011) the plurality of gene control networks based on eQTL, TF-binding, TF-motif, and prior probability based on a three-dimensional chromatic interaction structure, And allocating a link value to an arbitrary element of each of the gene regulation networks (S1012).

&Lt; Example 3 >

Yet another embodiment of the present invention relates to an apparatus for generating a gene regulation model for performing the gene regulation model generation method according to the first or second embodiment. The genetic model generation apparatus may include a processing unit and a storage unit capable of performing the respective steps and processes described in the first or second embodiment.

<Example 4>

Yet another embodiment of the present invention relates to a computer-readable device readable by a gene-regulated model generating apparatus for performing the gene-regulated model generating method according to the above-described first or second embodiment. The computer-readable device may be recorded with a program code for allowing the genetic model generating device to perform the steps and processes described in the first or second embodiment.

Hereinafter, in order to facilitate understanding of the embodiments of the present invention, actual experimental examples in which the techniques according to the above-described embodiments are implemented will be described. The specific cells, genes, chromosomes, and specific numbers shown in the following Experimental Examples are for illustrative purposes only, and it is easy to understand that the scope of the present invention is not limited thereto. Experimental examples to be described later are divided into &quot; Generation of integrated data combining prior information &quot;, &quot; Genetic algorithm implementation &quot;, and &quot; Performance comparison of one embodiment of the present invention with other methods.

<Integrated data generation combining dictionary information>

In the present invention, it is possible to integrate genome, transcript, and epigenetic information to construct a gene regulatory network. As a sample for constructing a gene regulation network, it is possible to integrate genome information, transcript information and epigenome information of specific cells (ex: breast cancer tissue derived cells). This integrated information is used as preliminary information for building a network model and data for model learning. Herein, the dictionary information is referred to as dictionary information. TF binding data, three-dimensional information (Chromatin interaction data), and eQTL information can be used for the dictionary information. From this fryer, an evolutionary algorithm can be used to extract the most probable network. For this purpose, each network can be evaluated using gene expression data of the specific cell (ex: breast cancer tissue derived cell). In embodiments in which the present invention is embodied, breast cancer is specifically targeted. The data used for this is described below.

1. gene expression data

Gene expression data for the breast cancer gene was expressed in the open gene, The Cancer Genome Atlas (TCGA, www.cancergenome.nih.gov), which was used for breast invasive carcinoma (BRCA) Data.

2. TF binding data

The transcription factor (TF) binding data was obtained from the TF binding data of the Encyclopedia of DNA Elements (ENCODE, http://genome.ucsc.edu/ENCODE/) project. This is the result of performing ChIP-seq on a specific TF. For breast cancer-related information, information on 26 TFs in the MCF-7 cell line and binding data for 6 TFs in the T47D cell line were used. In addition, a total of 159 TF binding data for a total of 104 cell lines were used.

TF binding (motif) prediction was performed based on regulatory region data of breast cancer-derived cell line (MCF-7, T47D), including breast cancer TF binding data. The control area used DNase-seq data, the open chromatin data of the ENCODE project. First, DNA sequencer motifs to which TF binds were experimentally approached and data in the form of a position weight matrix (PWM) was obtained from the TRANSFAC database. We associate this PWM data with the regulatory region sequence of the breast cancer cell line to estimate the sequence-specific binding region of TF on the breast cancer genome. The estimation work was performed using the MEME SUITE package, in which the PWM data was modified appropriately through the MEME and finally the TF binding prediction was performed in the regulatory region of the breast cancer cell line through the FIMO tool of the MEME SUITE package. Through this, binding prediction of 378 TFs was completed and used.

3. Three-dimensional information (Chromatin interaction data)

In the present invention, a three-dimensional structure of a dielectric is constructed as a mathematical model in addition to the conventional TF binding-based estimation method. From the data derived from the 3-dimensional reproductive genome experiments such as Hi-C, it is possible to estimate the structure of the DNA by measuring the frequency of the interaction of the DNA on the dielectric, and the physical structure We extracted genes that could be influenced by each other.

Chromatin interaction data were obtained from the RNA Pol II and CTCF ChIA-PET data for the MCF-7 cell line provided by the ENCODE project. This is the result of finding three-dimensional chromatin interactions around the factors responsible for transcription and insulation, respectively. In the system of the present invention, this data was used to identify three-dimensional relationships between promoters and enhancers acting as regulatory regions. Based on this, it was used to annotate the actual regulatory interactions in combination with eQTL and TF binding data.

4. Preliminary information based on eQTL

Data on eQTL of breast cancer were obtained from the data used in the following paper.

Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumors reveals novel subgroups. Nature 486, 346-352 (2012).

In all 1,992 primary tumors, SNV, CNV, and CNA were identified and eQTL mapping between these mutations and gene expression was used. At this time, the distance between the mutation and the gene was regarded as cis-association within 3Mb.

To establish a genomic regulatory network of breast cancer, the regulatory interactions and probabilities of each gene pair were estimated using genomic and epigenetic data. As the genomic data, we used the breast cancer eQTL data described above. The method of inferring the relationship between genes in the eQTL mapping results is based on the assumption that the gene that undergoes the trans-effect between genes sharing the same mutation is regulated by the gene that undergoes the cis-effect . In the situation shown in FIG. 2, if one cis gene and two trans genes are affected by the same SNP, it is assumed that two trans genes are affected by the cis gene. It is also possible to assume that the trans gene controls other trans genes, but does not consider the case where the trans gene regulates the cis gene.

FIG. 2 is a schematic diagram of the analogy of control relationships between genes sharing the same mutation.

Based on these results, the control interactions and the probability values of each gene pair were calculated for the following four cases. EQTL-based preliminary information on a total of 10, 425 genes was completed.

Case 1: There is a gene A and a gene B sharing the SNP as eQTL. If the SNP is cis-acting with gene A and trans-acting with gene B, A controls B and the probability is p = 1 2)

Case 2: When eQTL has CNV or CNA, n genes undergo a cis-effect by CNV or CNA, and when gene B undergoes a trans-effect, each of n genes regulates B and the probability is p = 1 / n. This is a special case in which eQTL is a CNV or CNA, and genes with cis-acting regulation are simply increased by the number of copies when the gene copy number is increased, thus confirming which genes undergo trans-acting regulation I can not.

Case 3: When SNPs are shared by eQTL but only when SNPs are trans-acting without cis-acting, it is impossible to know which genes are regulated and regulated in this situation. Using the chromatin interaction data, -effect gene A is considered to regulate the gene B undergoing the trans-effect without chromatin interaction, and p = 1 is used as the probability value.

Case 4: In Case 3, the relationship between genes sharing the same SNP is probabilistically estimated by counting the number of eQTL SNPs that are not shared by each gene, if the relationship can not be deduced from chromatin interaction data. This is based on the assumption that genes with fewer variants (SNPs) to be regulated in the eQTL mapping will be higher in the control step.

5. TF-binding based fryer construction

As mentioned above, TF binding data and chromatin interaction data of the ENCODE project were used as the background data for establishing the breast cancer genome regulatory network. Basically, when TF binds to a promoter and enhancer of a gene, it is assumed that TF regulates the gene to be bound. The promoter was assumed to be + - 1.5 Kb based on the TSS of the gene, and public chromatin data and chromatin interaction data were used for enhancer annotation. Based on this, we have estimated the coordination interactions and probability values for the following three cases and completed TF-binding-based prior data for a total of 12,765 genes and 127 TFs.

Case 1: When TF (gene A) binds to the promoter of gene B, A modulates B and the probability is assumed to be p = 1. In addition, when the promoter of B exhibits a promoter-promoter interaction with n genes by chromatin interactions, A, which is a TF, regulates each of n genes and assumes p = 1 / n. This is because when TF binds directly to the promoter of the gene but when it is revealed that the n genes to which the promoter is linked by the chromatin interaction and the promoter of the gene B are three-dimensionally adjacent, they are all controlled by the same TF .

Case 2: TF gene A binds to gene B enhancer. Gene A regulates gene B and probability is assumed to be p = 1. If the TF binding data and chromatin interaction data overlap in regions other than the B promoter (1.5 Kb outside the TSS) and the other chromatin interaction data overlaps the promoter of B, It is assumed that the TF binds with the enhancer. In the case of showing the promoter-promoter interaction in the same manner as in case 1, p = 1 / n is assumed.

Case 3: In case 1 and case 2 above, the TF binding data of breast cancer-derived cell line (MCF-7, T47D) was used in the ENCODE project, but the control effect of TF not actually treated in the above- Experimental results can also be found. Thus, all TF binding data available in whole cell lines, not in breast cancer cell lines, can be used. Instead, we do not know which TF binding actually occurs in breast cancer, so we can divide the probability values so that we have probability values lower than Case 1 and Case 2.

6. Pre-information based on TF-motif

All of the above TF binding data are based on the ChIP-seq experiment of the ENCODE project. Further, in the system of the present invention, the TF binding region is predicted based on the binding motif, and based on the result, the same analysis as in item 5 is attempted. In other words, the task of constructing the fryer based on the TF binding of the entire cell line was carried out based on the TF motif prediction. This hypothesis is based on the assumption that TF binding motif-based predictions can detect regulatory effects not found in ChIP-seq experiments, as can be found in whole cell lines, which is not revealed in ChIP-seq experiments . Through this analysis, TF-motif-based dictionary data for a total of 14,552 genes and 473 TFs were completed.

7. Data integration for network model creation

These complete eQTL, TF-binding, TF-motif, and 3D structure-based prior information were used to build transcription control networks. Finally, it is summarized as a prior information matrix consisting of a gene A (regulator), a gene B (target), and a prior probability. If the same gene A and B pair are different dictionary data If there is a duplication in the other, the higher probability is used. This is because it assumes that complementary interactions that are not found in dictionary data with different sources are complemented. In this way, final data on 15,749 genes and 473 TFs were completed.

Based on the above dictionary data, in order to actually construct a network, learning data (observations) are needed to mobilize the machine learning technique. In the learning algorithm of the present invention, learning data was constructed by observing whether or not expression of a specific gene affects other genes using the breast cancer gene expression data mentioned in the above 1. When the expression of a specific gene increases, the following steps are performed to calculate the conditional probability that the expression of another gene increases.

Step 1: Ranking the expression of gene A in each sample on a rank-based basis.

Step 2: Sorting the samples in the order that the gene A is highly expressed.

Step 3: Selecting the sample pair in which the gene A significantly increases (that is, the ranking becomes) the expression.

Step 4: The step of calculating the probability that the expression of the other gene B increases with respect to the sample pair of the step 3.

Step 5: Perform steps 1 to 4 above for all different gene pairs (A, B).

Cloning techniques can be used to determine the expression of genes in existing methods. In other words, the data are divided into three clusters and considered as the three groups of expression decrease, no expression increase, and expression increase. Primarily, k-means clustering or Partitioning Around Medoids (PAM) algorithm with higher noise robustness can be used.

The integrated dictionary and learning data have different sets of genes. Therefore, in the construction of the network model, new data consisting only of genes common to the preliminary data and the learning data was newly constructed and used. The data actually used were 13,047 genes, including 436 TFs.

<Implementation of genetic algorithm>

1. Genetic Algorithm Overview

In the present invention, a genetic algorithm is used for learning a network model. In the existing methodology, based on the MCMC, a Bayesian network was constructed. However, due to the performance problem, there is a limit in configuring a network for more than 1,000 genes. Genetic algorithms can show a greater ability to navigate than MCMC based on greedy algorithms. There are N * (N-1) possible links in the network to derive the relation of N genes, and it is known that finding the optimal combination of these is NP-hard complexity. Therefore, we aimed to find a solution close to the optimal solution using, for example, 1,000 networks obtained from genetic algorithms.

Here is the overall sequence of genetic algorithms.

Step 1. Create an initial population of gene regulation networks (Create the initial population of gene regulation networks)

a. Add links based on prior probabilities.

b. Add randomly generated links based on Pawson distribution (Add randomly generated links on the Poisson distribution)

Step 2. Run the evolution loop

Step a. Evolution of each chromosome (network) based on the fitness score function (Evaluate each chromosome (network) on the fitness score function)

Step b. Classification of the genes into four groups according to each fitness score (according to four groups according to their fitness score)

Step c. Crossover chromosomes based on their group identity based on the identifiers of each group.

Step d. Mutation of genes according to the mutation rate defined for each group (Mutation chromosomes according to defined mutation rate for each group)

Step e. Repeat steps a to d for a generation of the predetermined maximum number of times of stop (stop condition) (predefined maximum number of generations (stopping criteria))

Step 3. Combine the genes to produce the optimal network (Combine the chromosomes and output the optimal network)

Step 4. Repeat steps 1 to 3 to create a predetermined number of different networks (Repeat steps 1-3 to generate a predefined number (m) of different networks)

Step 5. Combine the m networks.

2. Expression of data

The genetic algorithm according to the present invention aims at finding the most likely gene regulation network from the given gene expression results and all possible gene regulation relationships. At this time, candidate solutions of possible networks can be logically expressed in the form of N * N two-dimensional matrix as shown in FIG. If there is a control relationship from the A gene to the B gene, that is, if there is a link of A- > B, the intersection element of the Ath row and the Bth column of the matrix is represented by 1. [ In addition, whether the regulation between genes A and B is positive or negative is also considered. In the negative relation, that is, in the case of repression, the intersection element of the A-th row and the B-th column is represented by -1 if there is a relation of A- | B. Therefore, every cell in an array can have one of three values. For example, when performing on 13,047 genes with gene expression data, the number of possible regulation is 13,047 x 13,046, and each of them can have a value of 1, 0, or -1, The size of the space where the solution exists is 3 ^ (13,047x13,046).

3 is a diagram illustrating a logical expression of a gene regulation network according to an embodiment of the present invention.

In order to save the execution time of the algorithm, the actual physical implementation can be performed as a one-dimensional array with N dynamic lengths. In other words, if there is a positive regulating relationship from A to B, there is an element called B, + in the A-th array.

3. Genetic algorithm initialization

In the genetic algorithm, one candidate solution, or one N * N matrix, is called a chromosome. All of them began to evolve after generating an initial population of 128 chromosomes. Initial populations were generated from the dictionary information introduced in Section 1. If there is a probability that the A gene controls the B gene in the prior, if this value is 1, the initial population will have a link of A-> B at 30% probability. That is, there are about 38 A-> B links in 128 populations. In addition, at least four initial populations have been corrected for the link to exist, since there is a possibility that it will not belong to any of the 128 populations present in the fryer but with a small probability value. At this time, it was difficult to know whether the regulatory relationship between genes was positive or negative by the information of fryer. Therefore, if there are candidate links in the fryer, there are about 4 to 38 in the initial population, and there are a half of the positive links and half of the negative links.

4. Chromosome evaluation procedure

Genetic algorithms are based on the survival of the fittest and define the degree of adaptation to the environment as a chromosome fitness score. The fitness score is obtained from the probability table obtained from the expression amount information of the gene. This probability table is composed of conditional probabilities. For example, the value of P (B | A) is the conditional probability that gene B is over-expressed under the condition that gene A is over-expressed. The fitness score of each link is calculated from the probability table. When the A gene has no effect on the B gene, the fitness score is 0, and if the B gene is activated in all cases where the A gene is activated, it becomes 100 points. Probably, P (B | A) would be equal to P (B) if expression of A gene had no effect on B gene. Therefore, if P (B | A) = 1, then 100 points and P (B | A) = P (B) The inhibition of expression was also calculated in a similar way. The formula is expressed as follows.

Positive control score from A to B

positive regulation score from A to B,

PRS [A] [B] =

P (B is over-expressed | A is over-expressed)

- P (B is over-expressed)

+ (B is under-expressed | A is under-expressed)

- P (B is under-expressed)

Voice control scores from A to B

negative regulation score from A to B,

NRS [A] [B] =

P (B is under-expressed | A is over-expressed)

- P (B is under-expressed)

+ P (B is over-expressed | A is under-expressed)

- P (B is over-expressed)

Overall adjustment score from A to B

overall regulation score from A to B,

ORS [A] [B] = PRS [A] [B] + NRS [A]

If link E predicts that gene X is a positive regulator of gene Y,

If a link E predicts that gene X regulates gene Y positively,

Score (E) = ORS [A] [B] + PRS [A] [B]

If link E predicts that gene X is a negative regulation of gene Y,

If a link E predicts that gene X regulates gene Y negatively,

Score (E) = ORS [A] [B] + NRS [A] [B]

For example, in order to verify the present invention, 13,047 genes were simultaneously expressed in 1,385 samples, and the fitness score of each link was calculated for 13,047x13,046 cases. There is a pair of scorecards because there is a score for activation and a score for repression for a link, respectively. This table is referred to as a landscape.

The fitness score of one chromosome is calculated using this landscape. It is the chromosomes that represent one candidate solution. The scores of all the links (here referred to as 1-hop links) present on this chromosome and all pairs of genes that can go through 2-hop ), And then multiplies the correction value determined by the number of 1-hop links to determine the final goodness score. For example, if A-> B link exists in the network, A-> B is a 1-hop link. If A-> B link and B-> C link exist at the same time, A to C can go through two hop So A-> C becomes a 2-hop link. A 2-hop link may have multiple paths by nature, in which case all of the fit scores are added in all cases. The final score is obtained by multiplying the above result by the correction value.

The fitness score of chromosome C (C score)

Figure 112015043908936-pat00001

Compensation used for the final score is determined by the number of links, because the correction value is necessary because the score may be unfair as the number of links increases or decreases. With simple addition without correction, the larger the number of links, the more advantageous it is. Therefore, the population evolves in a direction where the number of links increases. Therefore, the correction value is set so that the number of links evolves appropriately. This was not obtained absolutely and repeatedly obtained through several experiments.

5. Evolution Operator

In the genetic algorithm, all chromosomes of the population are sorted from the first to the 128th according to the fitness scores. Thereafter, the chromosomes are divided into four groups of 'elite', 'good', 'normal' and 'poor'. Of these, the 'elite' group has been designed to preserve 150% of their genes in the next generation, 100% in the 'good' and 'normal' groups, and 50% in the 'bad' group.

First, one chromosome is selected from the 'Elite' group and the 'Poor' group, and then the parent 'bad' chromosome is replaced with its offspring. In this way, the chromosomes of the 'Elite' group will inherit 50% of their genes to their offspring, and will survive the next generation, delivering a total of 150% of the genes. The chromosomes of the 'bad' group will inherit only 50% of their children. The next step is to select one chromosome from the 'good' group and the 'normal' group, replacing the parent with the two offspring. Parents of both groups will inherit 50% of their genes to their two children, thus preserving a total of 100% of the genes.

At this time, the intersection operation between two chromosomes was performed by exchanging blocks of a specific position in two two-dimensional matrices. First, two numbers a and b of 1 to 13,047 are randomly determined, and then the a-th column to the b-th column of two matrices are exchanged with each other. Next, two numbers c and d of 1 to 13,047 are randomly determined, and the c-th column to the d-th column of the two matrices are exchanged with each other. 4 is a schematic diagram of an operation of a crossing operator for a logical expression type according to an embodiment of the present invention.

After the crossing, each chromosome is mutated. In this case, the 'Elite' group does not mutate, and the 'Good', 'Normal', and 'Bad' groups mutate 0.1%, 0.2%, and 0.3% of the average number of links, respectively. There are three types of mutations: removing a link, creating a new random link, and changing the sign of a link. Among them, the code change means to change to a repressor relationship if the link is +, that is, the activation relation.

After the mutation is over, each chromosome is evaluated, sorted, and a new group is created. Thereby completing a generation of genetic algorithms.

6. Genetic algorithm termination and output

The genetic algorithm was performed after 20,000 generations. And only the link that appeared in more than 40% of the population was selectively output. Parameters for termination 20,000 households are empirically derived from repeated experiments. As a result of observing the progress of the genetic algorithm, only about 1% of links were changed from 10,000 generations to 20,000 generations. In addition, as a result of performing up to 100,000 households, only fewer than 0.1% of links have changed since 20,000 households. It can be observed that the evolution of the population is almost completed at about 20,000 households and most of them converge to a local-optimal solution.

Since the genetic algorithm has a stochastic nature, the converged network solution can also be called local-optimal. Therefore, in order to obtain a more stable solution, several different network solutions are obtained through repeated genetic algorithm, and then a consensus network is constructed by integrating them. Repeated genetic algorithms were performed to obtain 1,000 seed networks. The consensus network was calculated by repeatedly appearing links in more than 30% of the seeds and the final results were output.

&Lt; Comparison of performance of one embodiment of the present invention and performance according to another method &

According to an embodiment of the present invention, gene expression information can be learned through the above-described genetic algorithm using the constructed dictionary data. The resulting network may be referred to as a posterior network.

In one embodiment of the present invention, for 3,328,575 lines of prior information and 13,047 genes, the genetic algorithm generated 210,000 links as the initial population. As the chromosomes evolve, the number of links begins to decrease, leaving only about 130,000 links on average. Algorithm execution time differs depending on the information in the dictionary. In particular, the time required for the evaluation of chromosomes was dominant. As a result of measuring the CPU time on the i5-3570 CPU (3.40 GHz), it took about 292,300 seconds to evolve 128 chromosomes by 20,000 generations.

For comparison, MCMC algorithm was used under the same conditions. In order to verify the present invention, only three pieces of information were used: TF binding, eQTL, and gene expression information. However, in addition to this, six types of data can be integrated by adding metabolic data, protein-metabolism interactions, and protein-protein interactions.

When the Bayesian network is generated by the MCMC algorithm, it takes 61,350 seconds. On the surface, the genetic algorithm takes about 5 times as much time, but the genetic algorithm performs about 128 chromosomes, so it has about 24 times search power per the same time.

This comparison is made only when the same CPU is used. In the present invention, software that can be executed on a GPU having about 2,500 cores is further created by using an evolutionary algorithm suitable for parallel processing. This showed a speed increase of more than 10 times. In the existing methodology, it was difficult to learn models for more than 10,000 genes. However, using the new methodology, a network model can be constructed without using a large computing device such as a supercomputer. It is anticipated that it may be possible to easily construct a gene regulatory network for a new cell line at the level of a laboratory PC, or to generate a different gene regulatory network for each cell when a cell-like mutation is likely to occur, such as cancer.

As described above, the genetic algorithm of the present invention shows a better solution search performance than the existing MCMC. In addition, comparative experiments were conducted to show whether the results are consistent with the existing MCMC methodology. Based on 3,328,575 lines of preliminary information common to genetic algorithms and MCMCs and 13,047 genes, the same gene expression data were used as input values and learned by an algorithm. Since the genetic algorithm evolves 128 chromosomes on a single run, it is assumed that approximately 100 MCMCs and 1 genetic algorithm will yield similar results, and that the network derived from 10 genetic algorithm results and the network derived from 1000 MCMC results Respectively.

It was difficult to directly compare the results. This is because the network derived by MCMC and the network derived by genetic algorithm are very different. In addition, since MCMC is based on the Bayesian network, the loop has been eliminated, but the genetic algorithm has also changed the shape of the network since the rest allowed only the self-loop. As a result, 10 MCMCs are generated from 10 genetic algorithms as an initial network, ie, seeds, and compared with 1,000 MCMC results, it can be seen that 90% or more are matched based on the links in the network.

Further, it is possible to implement software for analyzing changes in response to various drugs through the regulatory network constructed according to the present invention. After obtaining genes (DEG) whose expression has changed for each drug, it is possible to isolate the gene group in which the gene expression has been most changed, and to find out the most important regulatory factors that cause them. In order to calculate the top adjustment factor, the degree of connectivity in the control network was quantified as the connectivity score. This connectivity score can be used to estimate the regulatory factors (such as TF) acting on the drug. TF can mean a kind of gene. Which of the genes is TF may already be known. TF can refer to a protein or gene that plays a role in regulating the expression of a gene.

The present invention may also be used for methods of calculating a connectivity score between a transcription factor (TF) and a drug. The group of genes whose expression is greatly altered by each drug is called a responsive gene or a 'drug response gene', and a sub-gene group controlled by each TF can be called a downstream gene have. At this time, the connectivity score between the drug and TF is calculated as the degree of coincidence of the reaction gene and the subgenus. The detailed formula can be given as follows. In the following equation, ∧ is a logical AND operator. In the present invention, TF having the highest connectivity score among them can be referred to as a 'drug-responsive gene'.

Connectivity score between TFA and drug D (Connectivity score between TFA and drug D)

Figure 112015043908936-pat00002

The numerator in the above formula means the number of elements x in the set with the gene (x) reacting to Drug D and the element with x as a downstream gene controlled by TF A at the same time. And denominator means the number of elements contained in the set having the gene x as an element that reacts with the drug D.

To verify the utility of this invention, we constructed a gene regulatory network in breast cancer samples and then conducted a study to understand the changes in response to anticancer drugs in this sample. Of the existing data, gene expression data appeared after treatment of 150 small molecule drugs (including anticancer drugs) in MCF7 cell of breast cancer cell.

Using the present invention, it is possible to construct a TFs-low-molecular connectivity score matrix by calculating the connectivity score indicating the degree of connection with the DEG group and the regulatory network when each drug is processed from each TF, Drug, y-axis can be expressed to represent TF. Oestrogens and their receptor agonists / antagonists showed high connectivity with ESR1, and anti-cancer agents were found in luminal cancer regulators and cancer driver genes We found that the link between the predicted core regulatory gene (TF) and anticancer drugs (small molecules) through the highly regulated regulatory network is well suited to the known results.

Using the regulatory network constructed in the present invention can help to define treatments such as certain drug administration in diseases such as cancer. Cancer is caused by mutations in genes, and because it is genetically very complex, there are many variants of the same cancer. In this case, by identifying the causative gene using a differential expression gene and a regulatory network, it can be known whether the cancer cell is similar to any known cancer cell, and thus a similar therapeutic method can be applied.

In order to investigate the effect of the present invention, the present inventors have used a regulatory network constructed to identify a causative gene for each known cancer cell type. Using the differential expression gene, which is characteristic of cancer patients, and the regulatory network constructed in the present invention, it is possible to identify the most probable gene for each cancer patient subclass. Identified data can be used to elucidate some of the reasons why certain types of cancer patients respond favorably to certain anticancer drugs, but not to other anticancer drugs. For example, if a gene that is specifically expressed in a cancer patient is analyzed to obtain a result similar to that of Luminal A, a treatment for Luminal A can be applied to the cancer tissue There is a great possibility.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the essential characteristics thereof. The contents of each claim in the claims may be combined with other claims without departing from the scope of the claims.

Claims (22)

In the gene regulatory model generation apparatus, a gene regulatory model is generated using a gene regulatory network that can be represented by an N * N square matrix having a link value representing the regulatory relationship from gene i to gene j as the (i, j) (0 < i, j < = N)
Wherein the apparatus for generating a gene model includes a step of crossing the two gene control networks by exchanging elements of two square matrices representing two gene control networks among a plurality of gene control networks represented by the square matrix, Performing an evolution process to evolve a plurality of gene regulation networks; And
Wherein the gene regulation model generation device repeats the evolution process a number of times according to a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks,
/ RTI &gt;
A method for generating a gene regulation model.
The method according to claim 1,
Wherein the gene regulation model generation apparatus comprises: a step of generating a plurality of seed networks by performing the seed network generation process a plurality of times; And
Wherein the gene regulation model generation apparatus generates a consensus network by selecting link values repeatedly appearing in a plurality of seed networks according to a predetermined rule among the plurality of seed networks,
/ RTI &gt;
A method for generating a gene regulation model.
The method according to claim 1,
Wherein the evolution process further comprises calculating a fitness score of each of the gene regulation networks obtained based on conditional probabilities obtained from information on the expression amounts of genes represented by the respective rows of the N * N square matrix,
Characterized in that a gene control network with a higher fitness score is given more chance of said crossing.
A method for generating a gene regulation model.
The method of claim 3,
Wherein the number of elements of the first square matrix representing the first gene control network among the two gene control networks increases as the fitness score of the first gene control network increases, .
The method according to claim 1,
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, i, j) th element is determined by the adjustment probability.
A method for generating a gene regulation model.
The method according to claim 1,
Wherein the gene regulation model generation apparatus further comprises initializing the plurality of gene regulation networks before performing the evolution process,
Wherein said initialization further comprises assigning a link value to any element of each of said gene regulation networks based on a pawsons distribution.
A method for generating a gene regulation model.
3. The method of claim 2,
Wherein the gene regulation model generation device comprises: identifying a set of genes including genes responsive to a specific drug among genes included in the consensus network;
Calculating a connectivity score between the specific drug and the transcriptional regulatory gene for each of the transcriptional regulatory genes involved in transcriptional regulation among the genes contained in the consensus network; And
Wherein the gene regulation model generation apparatus comprises: a step of finding a transcription control gene having the highest connectivity score among the one or more transcription control genes;
To identify the gene responsible for the drug reaction in response to the specific drug
A method for generating a gene regulation model.
8. The method of claim 7,
The connectivity score for the transcriptional regulatory gene is calculated from the number of genes that are responsive to the particular drug and are regulated by the transcriptional regulatory gene divided by the number of genes responsive to the particular drug.
A method for generating a gene regulation model.
A method for generating a gene regulatory model using a gene regulatory network defined by link values indicating a regulatory relationship between a gene i and a gene j (0 < i, j < = N),
Wherein the gene regulatory model generating device is adapted to generate at least some of the link values included in the second gene regulatory network among the plurality of gene regulatory networks by replacing the link values with corresponding link values included in the first gene regulatory network, And crossing said second gene regulation network, said method comprising: performing an evolution process to evolve said plurality of gene regulation networks; And
Wherein the gene regulation model generation device repeats the evolution process a number of times according to a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks,
/ RTI &gt;
A method for generating a gene regulation model.
10. The method of claim 9,
Wherein the gene regulation model generation apparatus comprises: a step of generating a plurality of seed networks by performing the seed network generation process a plurality of times; And
Wherein the gene regulation model generation apparatus generates a consensus network by selecting link values repeatedly appearing in a plurality of seed networks according to a predetermined rule among the plurality of seed networks,
/ RTI &gt;
A method for generating a gene regulation model.
10. The method of claim 9,
Wherein the evolution process further comprises classifying each of the gene regulation networks into a plurality of groups according to a predetermined rule,
Wherein the number of the link values of the replacement part varies depending on the group to which the first gene control network belongs.
A method for generating a gene regulation model.
12. The method of claim 11, wherein the first gene regulation network and the second gene regulation network belong to different groups among the plurality of groups. 10. The method of claim 9,
The link value is a value
Has a value of 0 when the gene i does not regulate the gene j,
Wherein the gene i has a positive value when the gene j is positively regulated,
When the gene i is negative for the gene j,
A method for generating a gene regulation model.
10. The method of claim 9,
The evolution process comprises:
Further comprising calculating a fitness score of each of the gene regulation networks obtained based on conditional probabilities obtained from information on expression levels of genes contained in each gene regulation network.
15. The method of claim 14, wherein the fitness score is calculated using a sum of link scores for a link included in the gene control network, the link score including a link score for a one- Lt; RTI ID = 0.0 &gt; a &lt; / RTI &gt; link score for a two-hop link. 16. The method of claim 15, wherein a link score for a link from the gene i to the gene j in the link score is calculated from a positive control score and a negative control score from the gene i to the gene j. 10. The method of claim 9,
Wherein the gene regulation model generation apparatus further comprises initializing the plurality of gene regulation networks before performing the evolution process,
Wherein the initialization comprises generating the plurality of gene regulation networks based on a prior probability based on eQTL, TF-binding, TF-motif, and a 3-dimensional chromatic interaction structure.
A method for generating a gene regulation model.
11. The method of claim 10,
Wherein the gene regulation model generation device comprises: identifying a set of genes including genes responsive to a specific drug among genes included in the consensus network;
Calculating a connectivity score between the specific drug and the transcriptional regulatory gene for each of the transcriptional regulatory genes involved in transcriptional regulation among the genes contained in the consensus network; And
Wherein the gene regulation model generation apparatus comprises: a step of finding a transcription control gene having the highest connectivity score among the one or more transcription control genes;
/ RTI &gt;
A method for generating a gene regulation model.
A gene regulatory model generating device that generates a gene regulatory model using a gene regulatory network that can be represented by an N * N square matrix having a (i, j) -th link element representing a regulatory relationship from a gene i to a gene j (0 < i, j < = N)
And crossing the two gene control networks by exchanging elements of two square matrices representing two gene control networks among a plurality of gene control networks represented by the square matrix, And is adapted to perform an evolution process,
Repeating the evolution process a number of times in accordance with a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks.
Genetic Modulation Model Generator.
(0 < i, j < = N) for generating a gene regulation model using a gene regulatory network defined by link values representing the regulatory relationship from gene i to gene j,
By replacing at least some of the link values contained in the second gene control network among the plurality of gene control networks with corresponding link values included in the first gene control network, The method comprising: performing an evolution process to evolve the plurality of gene regulation networks,
Repeating the evolution process a number of times in accordance with a predetermined rule; And generating a seed network by selecting a link value appearing only in a plurality of gene control networks according to a predetermined rule among the plurality of gene control networks.
Genetic Modulation Model Generator.
A gene control network evolution method for evolving a gene regulation network using an intersection between gene regulation networks,
Wherein the gene regulatory model generation apparatus expresses the gene regulatory network as an N * N square matrix having a link value indicating an adjustment relationship from a gene i to a gene j as an (i, j) th element,
Wherein the intersection is performed using a step of copying an element of the square matrix representing the first gene regulation network to a corresponding location of the second gene regulation network.
Gene regulation network evolution method.

delete
KR1020150063824A 2015-05-07 2015-05-07 Algorithm for the construction of a regulatory network for more than 10,000 genes and method for the identification of causal genes in drug responses using the same algorithm KR101810527B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020150063824A KR101810527B1 (en) 2015-05-07 2015-05-07 Algorithm for the construction of a regulatory network for more than 10,000 genes and method for the identification of causal genes in drug responses using the same algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020150063824A KR101810527B1 (en) 2015-05-07 2015-05-07 Algorithm for the construction of a regulatory network for more than 10,000 genes and method for the identification of causal genes in drug responses using the same algorithm

Publications (2)

Publication Number Publication Date
KR20160132223A KR20160132223A (en) 2016-11-17
KR101810527B1 true KR101810527B1 (en) 2017-12-21

Family

ID=57542343

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020150063824A KR101810527B1 (en) 2015-05-07 2015-05-07 Algorithm for the construction of a regulatory network for more than 10,000 genes and method for the identification of causal genes in drug responses using the same algorithm

Country Status (1)

Country Link
KR (1) KR101810527B1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102228701B1 (en) * 2018-09-11 2021-03-16 가천대학교 산학협력단 Method and system for knowledge-based evaluation of dependency differentiation
CN112992267B (en) * 2021-04-13 2024-02-09 中国人民解放军军事科学院军事医学研究院 Single-cell transcription factor regulation network prediction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
H. Iba, 'Inference of a gene regulatory network by means of interactive evolutionary computing' Information Sciences 145 (2002) 225-236

Also Published As

Publication number Publication date
KR20160132223A (en) 2016-11-17

Similar Documents

Publication Publication Date Title
Chen et al. Drug–target interaction prediction: databases, web servers and computational models
Lei et al. GBDTCDA: predicting circRNA-disease associations based on gradient boosting decision tree with multiple biological data fusion
CN103246829B (en) The assembling of grand genome sequence
Dehmer et al. Applied statistics for network biology: methods in systems biology
Hu et al. LocalAli: an evolutionary-based local alignment approach to identify functionally conserved modules in multiple networks
Narang et al. Automated identification of core regulatory genes in human gene regulatory networks
Rocher et al. DeepG4: a deep learning approach to predict cell-type specific active G-quadruplex regions
Wang et al. Discover novel disease-associated genes based on regulatory networks of long-range chromatin interactions
KR101810527B1 (en) Algorithm for the construction of a regulatory network for more than 10,000 genes and method for the identification of causal genes in drug responses using the same algorithm
Xie et al. SRG-vote: Predicting miRNA-gene relationships via embedding and LSTM ensemble
Thenmozhi et al. Distributed ICSA clustering approach for large scale protein sequences and Cancer diagnosis
Zhang et al. Matrix factorization methods for integrative cancer genomics
Wnuk et al. Predicting DNA accessibility in the pan-cancer tumor genome using RNA-seq, WGS, and deep learning
Li et al. A comparative study for identifying the chromosome-wide spatial clusters from high-throughput chromatin conformation capture data
Dussaut et al. A review of software tools for pathway crosstalk inference
Abbas et al. TC-6mA-Pred: Prediction of DNA N6-methyladenine sites using CNN with transformer
Saha et al. ML-DTD: Machine Learning-Based Drug Target Discovery for the Potential Treatment of COVID-19. Vaccines 2022, 10, 1643
Ulaganathan Network-based Computational Drug Repurposing and Repositioning for Breast Cancer Disease
Mutalib et al. Towards applying associative classifier for genetic variants
Li et al. Prediction of human protein subcellular locations with feature selection and analysis
Tang et al. Novel computing technologies for bioinformatics and cheminformatics
Tradigo et al. G-quadruplex Structure Prediction and integration in the GenData2020 data model
Yılmazer Genome-and tissue-wide analysis of alternative polyadenylation events using clustering and feature learning methods
Mittal A Novel RNA Secondary Structure Site Accessibility Prediction Tool using Deep Learning
Katara et al. From omic to multi-integrative omics approach

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right