KR100463840B1

KR100463840B1 - An automatic template generation method for constructing protein interaction networks

Info

Publication number: KR100463840B1
Application number: KR20020082340A
Authority: KR
Inventors: 최재훈; 박수준; 박선희
Original assignee: 한국전자통신연구원
Priority date: 2002-12-23
Filing date: 2002-12-23
Publication date: 2004-12-29
Also published as: KR20040055893A

Abstract

PURPOSE: A method for automatically generating a template for constructing a protein interaction network is provided to reduce a network construction expense by automatically generating the basic template for the interaction network of an object protein based on the interaction network of the proteins, which are present in other species, similar to the object protein. CONSTITUTION: The proteins of a function similar to the object protein are searched from many species(S202). The interaction network of the searched proteins is generated based on a previously defined interaction relation database(S204). A similarity relation between the proteins existing in the generated interaction networks is set(S206). The network template of the object protein is generated by integrating the protein nodes between the different networks setting the similarity relation(S207).

Description

Automatic template generation for building protein interaction networks {AN AUTOMATIC TEMPLATE GENERATION METHOD FOR CONSTRUCTING PROTEIN INTERACTION NETWORKS}

본 발명은 바이오인포매틱스(bioinformatics)에서의 단백질 상호작용 네트워크 구축 기술에 관한 것으로, 특히, 하나의 종에 존재하는 특정 단백질에 대한 상호작용 네트워크를 다른 종에 존재하는 유사한 기능의 단백질에 대한 상호작용 네트워크를 활용하여 구축하는데 적합한 단백질 상호작용 네트워크 구축을 위한 템플릿 자동 생성 방법에 관한 것이다.FIELD OF THE INVENTION The present invention relates to techniques for building protein interaction networks in bioinformatics, and in particular, interaction networks for particular proteins in one species to interaction with proteins of similar function in other species. The present invention relates to a method for automatically generating a template for building a protein interaction network suitable for building using a network.

일반적으로 단백질 상호작용 네트워크는 단백질의 기능을 전체적인 관점에서 밝혀내기 위한 중요한 정보로 이용되고 있다. 즉, 단백질 상호작용 네트워크에서 밝혀지지 않은 특정 단백질의 기능은 이 단백질과 상호작용을 하는 다른 단백질로부터 유추할 수 있다. 또한, 특정 단백질의 기능을 억제하거나 활성화시킴으로써 생체에 파급되는 효과들을 예측할 수 있다.In general, protein interaction networks are used as important information for revealing the function of the protein from an overall perspective. That is, the function of a particular protein that is not revealed in the protein interaction network can be inferred from other proteins that interact with the protein. In addition, the effects on the living body can be predicted by inhibiting or activating the function of a specific protein.

따라서, 이러한 단백질 상호작용 네트워크는 신약개발에서 표적 단백질 선정에 매우 중요한 정보로 이용된다. 즉, 특정 단백질의 기능을 활성화시키거나 억제시키기 위해 이 단백질과 상호작용을 하는 단백질들을 검색하고, 이들 중에서 이미 자세한 정보가 밝혀진 단백질을 선정하여 그 기능을 조절하는 물질을 개발할 수 있기 때문이다. 이런 물질들은 매우 부가가치가 높은 의약품으로 개발될 수 있다. 따라서, 고부가가치의 신약물질들을 개발하기 위해서는 방대한 단백질들의 상호작용 네트워크를 체계적으로 구축할 필요가 있다.Therefore, this protein interaction network is used as important information for selecting a target protein in drug development. In other words, it is possible to search for proteins that interact with the protein to activate or inhibit the function of a specific protein, and to select a protein that has already been identified in detail, and to develop a substance that modulates the function. These substances can be developed into very valuable products. Therefore, in order to develop high value-added new drugs, it is necessary to systematically construct a large network of interaction proteins.

종래 연구에서 특정 단백질에 대한 상호작용 관계를 표현하기 위한 네트워크는 다음과 같은 방법을 통해 구축되었다.In the previous study, a network for expressing an interaction relationship for a specific protein was constructed by the following method.

첫째, "이스트 투 하이브리드" 방법은 특정 유전자(Gal4 gene)의 특성을 이용하여 두 단백질의 사이의 상호작용 관계를 밝혀낸다. 즉, 하나의 벡터에 Gal4 유전자의 DNA 결합 도메인 부위와 목표 단백질을 발현하는 유전자를 넣어 발현시킨다. 다른 벡터에는 Gal4 유전자의 발현 도메인 부위와 후부 단백질의 유전자를 넣어 발현시킨다. 이때, 목표 단백질과 후보 단백질이 상호작용 관계를 가지면 목표 단백질이 DNA와 결합하고 후보 단백질의 발현 부위가 DNA의 레포트 유전자를 발현하게 된다. 그러나, 이 방법은 방대한 단백질 모두에 대해서 실험을 수행해야만 하는 바, 많은 비용이 요구된다는 문제점을 가지고 있다.First, the "east to hybrid" method uses the characteristics of a particular gene (Gal4 gene) to reveal the interaction between the two proteins. In other words, the DNA binding domain region of the Gal4 gene and the gene expressing the target protein are expressed in one vector. In other vectors, the expression domain region of the Gal4 gene and the gene of the posterior protein are inserted and expressed. In this case, when the target protein and the candidate protein have an interaction relationship, the target protein binds to the DNA and the expression site of the candidate protein expresses the report gene of the DNA. However, this method has a problem that a lot of cost is required because the experiment must be performed on all of the vast proteins.

둘째, 마이크로어레이 데이터 분석 방법은 특정한 조건하에서 유전자들의 발현 패턴을 이용한 기술이다. 즉, 두 유전자의 발현 패턴이 서로 종속적이라면 두 유전자는 상호작용을 한다고 가정하고 있다. 예를 들어, 유전자 G1의 발현량이 증가할 때, 다른 유전자 G2의 발현량이 증가하거나 감소하면 G1과 G2는 서로 활성화시키거나 억제시키는 상호작용 관계를 갖는다고 말할 수 있다. 그러나, 이 방법은 특정 조건하에서 상호작용을 하지는 않지만 두 유전자의 발현 패턴이 서로 관련 있을 경우, 두 단백질이 정확하게 상호작용을 한다고 정의할 수 없다는 문제점을 가지고 있다.Second, microarray data analysis is a technique using expression patterns of genes under specific conditions. In other words, if the expression patterns of the two genes are dependent on each other, it is assumed that the two genes interact. For example, when the expression level of gene G1 increases, it may be said that G1 and G2 have an interaction relationship that activates or inhibits each other when the expression level of another gene G2 increases or decreases. However, this method does not interact under certain conditions, but when the expression patterns of the two genes are related to each other, there is a problem that the two proteins cannot be defined as precisely interacting.

셋째, 단백질 구조 분석 방법은 단백질 서열에서 모티프를 추출하고 이 부위를 3차원으로 표현하는 기술이다. 그리고, 기존의 밝혀진 모티프들 사이의 결합 관계와 3차원 구조를 분석하여 두 단백질의 상호작용 여부를 판별한다. 모티프의 결합은 단백질의 3차원 구조에 매우 종속적인 특징을 가진다. 그러나, 단백질의 3차원 구조를 기술적으로 예측하기 매우 어렵다는 문제점을 가지고 있다.Third, the protein structure analysis method is a technique of extracting a motif from the protein sequence and expressing this site in three dimensions. In addition, by analyzing the binding relationship and three-dimensional structure between the existing motifs to determine whether the two proteins interact. Binding of motifs is highly dependent on the three-dimensional structure of the protein. However, there is a problem that it is very difficult to technically predict the three-dimensional structure of the protein.

넷째, 텍스트 마이닝 방법은 다양한 언어처리 기법을 이용하여 바이오 문헌에서 단백질들 사이의 상호작용 관계를 기술하고 있는 부분을 추출하여 데이터베이스를 구축하고 있다. 그러나, 이 방법은 새로운 단백질의 상호작용 관계를 밝혀내는 것이 아니라 기존에 밝혀진 단백질들 사이의 상호작용 관계를 추출해서 데이터베이스를 구축하기 위해 사용된다.Fourth, the text mining method uses a variety of linguistic processing techniques to construct a database by extracting the parts describing the interaction relationship between proteins in the bio literature. However, this method is used to build a database by extracting interactions between previously discovered proteins, rather than revealing new protein interactions.

본 발명은 이러한 종래 기술들의 문제점들을 해결하기 위해 안출한 것으로, 다른 종에 존재하면서 목표 단백질과 유사한 단백질들의 상호작용 네트워크를 기반으로 목표 단백질의 상호작용 네트워크에 대한 기본 템플릿을 자동으로 생성함으로써 네트워크 구축 비용을 줄이도록 한 단백질 상호작용 네트워크 구축을 위한 템플릿 자동 생성 방법을 제공하는데 그 목적이 있다.The present invention has been made to solve the problems of the prior art, the network construction by automatically generating a basic template for the interaction network of the target protein based on the interaction network of proteins that exist in different species and similar to the target protein The aim is to provide a method for automatically generating templates for constructing protein interaction networks to reduce costs.

이러한 목적을 달성하기 위하여 본 발명은, 단백질 상호작용 네트워크 구축을 위한 템플릿 자동 생성 방법에 있어서, 목표 단백질과 유사한 기능의 단백질을 다른 종들에서 검색하는 단계와; 기존에 밝혀진 상호작용 관계 데이터베이스를 기반으로 검색된 단백질들의 상호작용 네트워크를 생성하는 단계와; 생성된 상호작용네트워크들 사이에 존재하는 단백질들 간의 유사 관계를 설정하는 단계와; 유사 관계가 설정된 서로 다른 네트워크들 사이의 단백질 노드들을 통합하여 목표 단백질에 대한 네트워크 템플릿을 생성하는 단계를 포함하는 단백질 상호작용 네트워크 구축을 위한 템플릿 자동 생성 방법을 제공한다.In order to achieve this object, the present invention provides a method for automatically generating a template for constructing a protein interaction network, the method comprising: searching for proteins of a function similar to a target protein in different species; Generating an interaction network of the retrieved proteins based on the previously known interaction relationship database; Establishing a similar relationship between the proteins present between the generated interaction networks; The present invention provides a method for automatically generating a template for building a protein interaction network, which includes generating protein templates for a target protein by integrating protein nodes between different networks having similar relationships.

도 1은 본 발명에 따라 단백질 상호작용 네트워크 구축을 위한 템플릿 자동 생성 시스템의 하드웨어 구성도,1 is a hardware diagram of an automatic template generation system for building protein interaction networks according to the present invention;

도 2는 본 발명의 바람직한 실시예에 따른 템플릿 자동 생성에 대한 전체 흐름도,2 is an overall flow chart for template automatic generation according to a preferred embodiment of the present invention;

도 3은 도 2의 세부 과정으로서, 목표 단백질과 기능이 유사하면서 다른 종에 존재하는 유사 단백질을 검색하는 과정의 흐름도,FIG. 3 is a detailed process of FIG. 2, which is a flowchart of a process of searching for a similar protein existing in another species with similar function to a target protein

도 4는 도 2의 세부 과정으로서, 검색된 단백질들에 대한 상호작용 네트워크를 이미 구축된 상호작용 관계 데이터베이스로부터 생성하는 과정의 흐름도,FIG. 4 is a detailed process of FIG. 2, which is a flowchart of a process of generating an interaction network for the retrieved proteins from an already established interaction relationship database;

도 5는 도 2의 세부 과정으로서, 생성된 네트워크들 사이에 존재하는 유사한 단백질 노드들의 유사 관계를 설정하는 과정의 흐름도,FIG. 5 is a detailed process of FIG. 2, which is a flowchart of a process of establishing similar relationships among similar protein nodes existing between generated networks;

도 6은 도 2의 세부 과정으로서, 유사 관계가 설정된 네트워크들을 하나의 네트워크 템플릿으로 통합하는 과정의 흐름도.FIG. 6 is a detailed process of FIG. 2 and illustrates a process of integrating networks having similar relationships into one network template. FIG.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 메인 메모리 101 : 중앙 처리부100: main memory 101: central processing unit

102 : 입/출력부 103 : 단백질 DB102: input / output unit 103: protein DB

104 : 상호작용 관계 DB 105 : 템플릿 생성부104: interaction relationship DB 105: template generation unit

106 : 시스템 버스106: system bus

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

본 발명은 이미 밝혀진 여러 종들의 단백질 상호작용 관계 데이터베이스로부터 특정한 종의 단백질에 대한 밝혀지지 않은 상호작용 네트워크의 기본 템플릿을 자동으로 생성하는데 그 특징이 있는 바, 방대한 단백질들에 대한 상호작용 네트워크들을 기본 템플릿을 통해 전체적인 관점에서 점진적으로 구축할 수 있게 함으로써 상호작용 네트워크 구축 비용을 현저하게 감소시킬 수 있는 방안을 제시한다. 이를 위해 기존에 밝혀진 단백질 상호작용 관계에 대한 데이터베이스를 이용하여, 밝혀지지 않은 목표 단백질에 대한 상호작용 네트워크의 기본 템플릿을 자동으로 생성하는 방법을 제공한다.The invention automatically generates a basic template of an undisclosed interaction network for a particular species of protein from a database of protein interaction relationships already known, which is characterized by the interaction networks for large proteins. By presenting a template that can be incrementally deployed from an overall perspective, we propose a way to significantly reduce the cost of building an interactive network. To this end, a method for automatically generating a basic template of an interaction network for an unknown target protein is provided by using a database of known protein interaction relationships.

본 발명에 따르면, 여러 종에 존재하는 방대한 단백질들의 상호작용 관계 데이터베이스를 이용하여, 특정한 종의 목표 단백질들에 대한 상호작용 네트워크의 기본 템플릿을 생성하는 것을 특징으로 한다.According to the present invention, using the interaction relationship database of a large number of proteins present in several species, it is characterized by generating a basic template of the interaction network for target proteins of a specific species.

이를 위해 본 발명은 목표 단백질과 유사한 기능을 수행하는 다른 종의 유사 단백질들을 검색할 수 있는 방법을 제공한다.To this end, the present invention provides a method for searching for similar proteins of other species that perform functions similar to the target protein.

또한, 검색된 이종의 유사 단백질들에 대한 상호작용 네트워크를 이미 구축된 상호작용 관계 데이터베이스로부터 생성하는 방법을 제공한다.It also provides a method for generating an interaction network for heterologous similar proteins retrieved from an already established interaction relationship database.

또한, 서로 다른 네트워크들에 존재하는 단백질 노드들 사이의 유사 관계를 설정하는 방법을 제공한다.It also provides a method of establishing similar relationships between protein nodes in different networks.

또한, 유사 관계가 설정된 네트워크들을 하나의 네트워크로 통합함으로써 목표 단백질에 대한 템플릿을 생성하는 방법을 제공한다.It also provides a method for generating a template for a target protein by integrating similarly established networks into one network.

도 1은 본 발명에 따라 단백질 상호작용 네트워크 구축을 위한 템플릿 자동 생성 시스템의 하드웨어 구성도로서, 메인 메모리(100), 중앙 처리부(101), 입/출력부(102), 단백질 DB(103), 상호작용 관계 DB(104), 템플릿 생성부(105), 시스템 버스(106)로 이루어진다.1 is a hardware configuration diagram of a template automatic generation system for building a protein interaction network according to the present invention. The main memory 100, the central processing unit 101, the input / output unit 102, the protein DB 103, It consists of the interaction relationship DB 104, the template generation part 105, and the system bus 106. FIG.

도 1에 도시한 바와 같이, 메인 메모리(100)에는 본 발명에 따른 템플릿 생성 시스템 정보와 각 단계에서 요구되는 단백질 DB(103) 및 상호작용 관계 DB(104)의 정보들이 탑재된다. 여기서, 단백질 DB(103) 정보는 예를 들어, "SWISS-PROT", 그리고 상호작용 관계 DB(104) 정보는 예를 들어, "KEGG나 INTERACT" 등이 이용될 수 있다.As shown in FIG. 1, the main memory 100 is loaded with template generation system information according to the present invention and information of the protein DB 103 and the interaction relationship DB 104 required in each step. The protein DB 103 information may be, for example, "SWISS-PROT", and the interaction relationship DB 104 information may be, for example, "KEGG or INTERACT".

중앙 처리부(101)는 메인 메모리(100)에 탑재된 템플릿 생성 시스템 정보를 단계별로 실행시키며, 입/출력부(102)는 시스템에서 필요한 정보를 사용자로부터 수신하여 시스템에 의해 자동으로 생성된 네트워크 템플릿에 관련된 내용을 화면에 출력한다. 이때, 각 장치들간의 메시지나 정보들은 시스템 버스(106)를 통해 송/수신된다.The central processing unit 101 executes the template generation system information mounted in the main memory 100 step by step, and the input / output unit 102 receives the necessary information from the system from the user and automatically generates the network template generated by the system. Prints the relevant information on the screen. At this time, messages or information between the devices are transmitted / received through the system bus 106.

도 2는 본 발명의 바람직한 실시예로서, 하나의 종에 존재하는 목표 단백질에 대한 상호작용 네트워크의 기본 템플릿을 자동으로 생성하는 과정의 흐름도이다.2 is a flowchart of a process of automatically generating a basic template of an interaction network for a target protein present in one species as a preferred embodiment of the present invention.

먼저, 단계(S201)에서는 네트워크 구축이 요구되는 목표 단백질과 종의 이름을 입력받고, 단계(S202)로 진행하여 입력받은 목표 단백질과 유사한 기능을 하면서 다른 종에 존재하는 유사 단백질을 단백질 DB(103)로부터 검색한다. 이때, 종 이름, 단백질 이름 그리고 단백질 서열이 이용된다.First, in step S201, a target protein and a name of a species for which a network is required to be input are input, and the process proceeds to step S202, in which a similar protein existing in another species and having a similar function to the input target protein DB 103 Search from). The species name, protein name and protein sequence are then used.

단계(S204)에서는 검색된 다른 종의 유사 단백질에 대한 네트워크를 상호작용 관계 DB(104)로부터 생성한다. 여기서, 상호작용 관계 DB(104)내에는 다양한 바이오인포매틱스 기술과 바이오 실험을 통해 검증된 단백질들에 대한 상호작용 관계 정보들이 기록된다.In step S204, a network is generated from the interaction relationship DB 104 for similar proteins of the different species searched. Here, interaction relationship information for the proteins verified through various bioinformatics techniques and bio experiments are recorded in the interaction relationship DB 104.

단계(S206)에서는 하나의 네트워크에 존재하는 단백질 노드들과 다른 네트워크의 단백질 노드들 사이의 유사 관계를 설정한다.In step S206, a similar relationship is established between protein nodes existing in one network and protein nodes in another network.

그리고, 단계(S207)에서는 네트워크들 사이에 존재하는 단백질 노드의 유사 관계를 통해 가상 네트워크 노드로 통합함으로써 목표 단백질에 대한 상호작용 네트워크 템플릿을 생성하게 된다.In operation S207, an interaction network template for the target protein is generated by integrating into a virtual network node through a similar relationship between protein nodes existing between networks.

이하에서는, 본 발명에 따른 단백질 상호작용 네트워크 구축을 위한 템플릿 자동 생성 방법의 단계별 세부 과정을 도 3 내지 도 6을 참조하여 보다 상세히 기술하기로 한다.Hereinafter, a detailed step-by-step process of the method for automatically generating a template for constructing a protein interaction network according to the present invention will be described in more detail with reference to FIGS. 3 to 6.

(가) 이종의 유사 단백질 검색 방법(A) Method of searching for similar protein of different species

하나의 종의 특정 단백질과 매우 유사한 기능을 수행하는 다른 종의 단백질들을 검색하는 방법은 매우 다양하다. 본 발명에서의 이종의 유사 단백질 검색 방법은 단백질의 이름과 서열을 이용하여 목표 단백질과 매우 유사한 기능을 수행하는 다른 종의 단백질들을 데이터베이스로부터 검색할 수 있게 한다. 도 3은 이러한 이종의 유사 단백질 검색 과정을 단계별로 설명하고 있다.There are many ways to search for proteins of another species that perform functions very similar to that of one species of protein. The heterologous similar protein search method in the present invention makes it possible to search from a database for proteins of other species that perform functions very similar to the target protein using the protein's name and sequence. Figure 3 illustrates the step-by-step process of searching for such heterologous proteins.

도 3에 도시한 바와 같이, 단계(S301)에서는 목표 단백질 이름 예컨대, P.name과 종 이름 예컨대, P.organism을 입력받는다. 여기서, 목표 단백질의 상호작용 네트워크는 밝혀지지 않았다고 가정한다. 일반적으로 같은 기능을 하는 단백질 이름은 비록 종이 달라도 동일하게 표기되는 경우가 많다. 대표적으로 단백질 AGRN(agrin)은 쥐와 사람에서 동일한 이름으로 표현되고 있다.As shown in FIG. 3, in step S301, a target protein name such as P.name and a species name such as P.organism are input. Here, it is assumed that the interaction network of the target protein is not known. In general, the same protein names are often the same even though they are different species. Typically, the protein AGRN (agrin) is expressed by the same name in mice and humans.

따라서, 단계(S302)에서는 단백질 DB(103)로부터 목표 단백질 P와 종은 다르지만 같은 이름의 단백질 P_i을 검색한다. 즉, P_i.name = P.name이고 P_i.organismP.organism인 모든 단백질 P_i=1..,n을 검색한다.Therefore, in step S302, a protein P _i of the same name but a different species from the target protein P is searched from the protein DB 103. That is, P _i .name = P.name and P _i .organism Search for all proteins P _{i = 1 .., n} that are P. organism.

그러나, 단백질 명명에 대한 표준화된 방법이 없기 때문에 목표 단백질과 매우 유사한 기능을 가지고 있는 단백질이 다른 종에 존재할 경우 그 이름이 서로 다르게 표기될 수 있다. 예를 들면, 사람(Homo sapiens)의 단백질 NEDD5(neural precursor cell expressed, developmentally down-regulated 5)와 쥐(Mus musculus)의 단백질 SEPT2(septin 5)은 비록 이름은 다르지만 그 기능이 완전히 동일함을 알 수 있다. 그런데, 기능이 매우 유사한 두 단백질은 일반적으로 거의 동일한 단백질 서열(sequence)을 가진다. 위의 두 단백질 역시 생체에서 동일한 기능을 가지고 있기 때문에 동일한 서열을 가지고 있다.However, since there is no standardized method for naming proteins, the names may be labeled differently if proteins with functions very similar to the target protein exist in different species. For example, human precursor cell expressed, developmentally down-regulated 5 of human (Homo sapiens) and protein SEPT2 (septin 5) of musculus, although different in name, are found to have exactly the same function. Can be. By the way, two proteins with very similar functions generally have almost identical protein sequences. These two proteins also have the same sequence because they have the same function in vivo.

따라서, 단계(S303)에서는 목표 단백질 P와 이름은 다르지만 서열이 매우 일치하는 다른 종의 단백질들을 단백질 DB(103)에서 검색한다. 즉, P_j.seqP.seq이고 P_j.organismP.organism인 모든 단백질 P_j=1..,m을 검색한다. 두 서열의 유사성 검사는 BLAST와 같은 유사 행렬(scoring matrix)을 이용한 부분 서열 정렬 알고리즘(local alignment algorithm)을 이용한다. 이때, 두 단백질의 서열 유사도가 0과 1 사이의 값으로 계산되며, 유사도가 0.9 이상일 때 일반적으로 두 단백질이 매우 유사한 기능을 수행한다고 밝혀져 있다.Therefore, in step S303, proteins of other species having a different name from the target protein P but having a very identical sequence are searched for in the protein DB 103. That is, P _j .seq P.seq and P _j .organism Search for all proteins P _{j = 1 .., m} that are P. organism. The similarity test of the two sequences uses a local alignment algorithm using a similar matrix such as BLAST. At this time, the sequence similarity of the two proteins is calculated to a value between 0 and 1, and when the similarity is 0.9 or more, it is generally found that the two proteins perform very similar functions.

단계(S304)에서는 목표 단백질과 유사한 여러 단백질들이 하나의 종에서 여러 개 검색되었을 경우 다음과 같은 규칙에 따라 유사 단백질들을 선정한다. 첫째, 하나의 종에 대해 여러 단백질들이 이름에 의해서도 검색되고 서열에 의해서도 검색되었을 경우, 이름에 의해 검색된 단백질들만을 선정한다. 둘째, 하나의 종에 대해 이름에 의해서는 단백질이 검색되지 않았지만 서열에 의해 여러 단백질들이 검색되었을 경우, 목표 단백질과 가장 일치하는 서열을 가진 단백질을 유사 단백질로 선정한다. 셋째, 다음 단계에서 유사 단백질들의 대한 상호작용 네트워크를 생성해야하기 때문에 이 검색된 단백질들 중 상호작용 네트워크가 이미 밝혀진 단백질들만을 최종 선정하게 된다.In step S304, when several proteins similar to the target protein are detected in one species, similar proteins are selected according to the following rules. First, if several proteins for a species are searched by name and by sequence, only those proteins searched by name are selected. Second, if a protein is not searched by name for one species but several proteins are searched by sequence, a protein having the sequence that most closely matches the target protein is selected as a similar protein. Third, since the next step is to create an interaction network for similar proteins, only those proteins whose interaction networks have already been identified are selected.

(나) 상호작용 관계로부터 네트워크 생성 방법(B) how to create a network from interaction relationships;

목표 단백질과 유사한 이종의 단백질들은 이미 상호작용 네트워크가 부분적으로 또는 전체적으로 밝혀져 있으며 상호작용 관계 DB(104)에 저장되어 있다. 본 발명에서는 이전 과정에서 검색된 목표 단백질에 대한 이종의 유사 단백질들과 관련된 관계들을 추출하고, 유사 단백질들에 대한 상호작용 네트워크를 넓이 우선 탐색 방법을 통해 생성한다. 이때, 넓이 우선 탐색을 위해 선입선출 기능을 담당하는 하나의 큐(Queue)가 이용된다. 도 4는 하나의 단백질로부터 형성되는 상호작용 네트워크를 생성하는 방법에 대해 설명하고 있다.Heterologous proteins similar to the target protein are already identified, in part or in whole, with the interaction network and stored in the interaction relationship DB 104. In the present invention, the relations related to heterologous similar proteins with respect to the target protein retrieved in the previous process are extracted, and an interaction network for the similar proteins is generated through a wide-first search method. At this time, one queue in charge of the first-in, first-out function is used for the breadth-first search. 4 illustrates a method of generating an interaction network formed from one protein.

도 4에 도시한 바와 같이, 단계(S401)에서는 도 2의 단계(S202)에서 검색된 이종의 유사 단백질들 중 하나의 단백질 p를 입력받는다. 이 단백질에 대한 네트워크를 상호작용 관계 DB(103)로부터 생성하게 된다.As shown in FIG. 4, in step S401, one protein p of heterologous similar proteins detected in step S202 of FIG. 2 is input. A network for this protein is generated from the interaction relationship DB 103.

따라서, 단계(S402)에서는 P를 중복 방문하지 않기 위해 P에 방문 플래그를 설정한다.Therefore, in step S402, a visit flag is set to P in order not to visit P repeatedly.

단계(S403)에서는 P와 직접 상호작용 관계를 가지는 단백질 P_i=1,..,n을 상호작용 관계 DB(103)로부터 추출한다.In step S403, proteins P _{i = 1, .., n} having a direct interaction with P are extracted from the interaction relationship DB 103.

단계(S404)에서는 P와 검색된 단백질 P_i사이의 상호작용 관계 iLink를 점진적으로 연결하여 네트워크를 생성해 간다.In step S404, a network is formed by gradually connecting the interaction relationship iLink between P and the retrieved protein P _i .

단계(S405)에서는 단백질들 P_i=1,..,n중 방문하지 않은 모든 단백질 P_k,1 _k _n을 모두 큐에 삽입한다. 이 단백질들을 큐에 삽입함으로써 넓이 우선 탐색에 이용한다.In step S405, all proteins P _{k, 1} not visited among proteins P _{i = 1, .., n} _k Insert all _n into the queue. By inserting these proteins into the queue, they are used for breadth-first search.

단계(S406)에서는 큐에 삽입한 단백질 P_k를 다시 방문하지 않기 위해 방문 플래그를 설정한다.In step S406, the visit flag is set so as not to visit the protein P _k inserted into the queue again.

단계(S407)에서는 큐가 비어 있으면 종료하고 비어있지 않으면, 단계(S408)에서 큐의 단백질 P를 추출하여 단계(S403)로부터 단계(S407)를 반복한다. 이 과정을 종료하게 되면 단계(S401)에서 입력한 단백질에 대한 상호작용 네트워크를 생성할 수 있다.In step S407, if the queue is empty, the process ends. If not, the protein P of the queue is extracted in step S408, and step S407 is repeated from step S403. When this process is finished, it is possible to generate an interaction network for the protein input in step S401.

따라서, 도 2의 단계(S202)에서 검색된 모든 단백질들은 이 과정을 통해 상호작용 네트워크를 생성한다.Therefore, all proteins retrieved in step S202 of FIG. 2 generate an interaction network through this process.

(다) 상호작용 네트워크 유사 관계 설정(C) establishing interactive network affinity

목표 단백질의 상호작용 네트워크에 대한 템플릿은 목표 단백질과 유사한 이종의 단백질들에 대한 상호작용 네트워크들을 포괄할 수 있는 통합된 형태로 표현된다. 따라서, 도 5에서는 도 4에서 생성된 목표 단백질에 대한 유사 단백질들의 상호작용 네트워크들 사이에 존재하는 단백질 노드들의 유사 관계 "hLink"를 설정하는 방법에 대해 설명하기로 한다.The template for the interaction network of the target protein is expressed in an integrated form that can encompass the interaction networks for heterologous proteins similar to the target protein. Therefore, in FIG. 5, a method of establishing a similar relationship "hLink" of protein nodes existing between interaction networks of similar proteins with respect to the target protein generated in FIG. 4 will be described.

도 5에 도시한 바와 같이, 단계(S501)에서는 도 2의 단계(S204)에서 목표 단백질에 대한 유사 단백질 P⁽¹⁾, P⁽²⁾, ...., P⁽ⁿ⁾로부터 생성된 n개의 상호작용 네트워크 N⁽¹⁾, N⁽²⁾, ...., N⁽ⁿ⁾을 입력받는다.As shown in FIG. 5, in step S501, ⁿ generated from similar proteins P ⁽¹⁾ , P ⁽²⁾ , ..., P ⁽ⁿ⁾ for the target protein in step S204 of FIG. N interaction networks N ⁽¹⁾ , N ⁽²⁾ , ...., N ⁽ⁿ⁾ are input.

단계(S502)에서는 이들 중 두 개의 네트워크 N^(s)와 N^(k)를 선택하여 두 네트워크에 존재하는 단백질들 사이의 유사 관계를 설정한다. 따라서, n개의 모든 네트워크들 사이의 유사 관계를 설정하기 위해서는 총 n(n-1)/2개의 조합이 발생하게 된다. 이 조합 각각에 대해 단계(S503)과 단계(S504)를 수행한다.In step S502, two networks N ^(s) and N ^{(k) of these} are selected to establish a similar relationship between proteins present in the two networks. Therefore, a total of n (n-1) / 2 combinations is generated to establish a similar relationship between all n networks. Step S503 and step S504 are performed for each of these combinations.

먼저, 단계(S503)에서는 두 개의 이종간 네트워크의 모든 단백질 노드들에 대해 이름이 동일한 노드 P^(s) _i, P^(k) _j사이의 유사 관계 hLink를 설정한다. 이때, hLink의 유사도는 1.0으로 설정한다.First, in step S503, a similar relationship hLink between nodes P ^(s) _i and P ^(k) _j having the same name is set for all the protein nodes of the two heterogeneous networks. At this time, the similarity of hLink is set to 1.0.

단계(S504)에서는 이름기반으로 설정되지 않은 모든 단백질 노드들에 대해 서열기반으로 hLink를 설정한다. 이때, 서열의 유사도가 0.9이상인 경우만 두 단백질 노드의 유사 관계 hLink가 설정되며, 그 유사도는 서열의 유사도에 의해 결정된다.In step S504, hLink is set on a sequence basis for all protein nodes that are not set on a name basis. At this time, the similarity relationship hLink of two protein nodes is established only when the sequence similarity is 0.9 or more, and the similarity is determined by the similarity of the sequence.

(라) 상호작용 네트워크 템플릿 생성(D) Create interactive network templates

이 과정에서는 목표 단백질과 유사한 이종의 단백질들에 대한 네트워크들을 입력받아 유사 관계 hLink가 설정된 유사 단백질 노드들을 하나의 가상 노드로 통합함으로써 목표 단백질에 대한 상호작용 네트워크 템플릿을 생성한다. 따라서, 6도에서는 네트워크의 상호작용 관계 iLink를 넓이 우선 탐색 방법으로 항해하면서 유사 관계 hLink가 설정된 유사 단백질 노드들을 하나의 가상 단백질 노드로 통합함으로써 목표 단백질에 대한 템플릿 네트워크를 생성하는 방법에 대해 설명하기로 한다.In this process, we generate networks of target proteins by inputting networks of heterologous proteins similar to target proteins and integrating similar protein nodes with similarity hLink into one virtual node. Thus, in Figure 6, we describe how to create a template network for the target protein by navigating the network interaction relationship iLink in a breadth-first search method and integrating similar protein nodes with the similarity hLink into one virtual protein node. Shall be.

도 6에 도시한 바와 같이, 단계(S601)에서는 이종간 유사 단백질 노드들 사이에 유사 관계 hLink가 설정된 네트워크 N⁽¹⁾, N⁽²⁾, ..., N⁽ⁿ⁾을 입력받는다.As shown in FIG. 6, in step S601, networks N ⁽¹⁾ , N ⁽²⁾ , ..., N ^{(n) in} which affinity hLink is set between heterologous protein nodes are input.

단계(S602)에서는 네트워크 템플릿 NT을 처음 네트워크 N⁽¹⁾로 초기화한다.In step S602, the network template NT is initially initialized to the network N ⁽¹⁾ .

단계(S603)에서는 네트워크 N⁽²⁾부터 N⁽ⁿ⁾에 대해 단계(S604)에서 단계(S610)까지의 과정들을 반복 수행하면서 네트워크들을 템플릿 NT으로 통합한다.Step (S603) and the network N ⁽²⁾ repeat the process from step to (S610) in the step (S604) for the N ⁽ⁿ⁾ incorporating a network NT as a template.

먼저(S604)에서는 넓이 우선 탐색을 위해 N⁽ⁱ⁾에서 초기 방문 단백질 노드 p를 설정한다.First, in S604, the initial visited protein node p is set at N ⁽ⁱ⁾ for the breadth-first search.

단계(S605)에서는 N⁽ⁱ⁾에서 P와 상호작용 관계 iLink를 가진 모든 단백질 노드 P_j=1,..,s를 추출한다.In step S605, all protein nodes P _j = 1, .., s having an interaction relationship iLink with P in N ⁽ⁱ⁾ are extracted.

단계(S606)에서는 단계(S605)에서 추출한 단백질 노드들 중 최초 방문 노드만을 탐색하기 위해 큐에 삽입하고, 단계(S607)에서 이들을 방문 표시한다.In step S606, a queue is inserted to search only the first visited node among the protein nodes extracted in step S605, and a visit is displayed in step S607.

단계(S608)에서는 현재 단백질 노드 P가 네트워크 템플릿의 하나의 단백질 노드 P_NT와 유사 관계 hLink를 가지고 있는지를 판단하고, 유사 관계 hLink를 가지고 있다고 판단되면 단계(S609)로 진행하여 두 개의 단백질 노드를 통합하나, 그렇지 않으면 단계(S610)로 진행한다.In step S608, it is determined whether the current protein node P has a similarity hLink with one protein node P _NT of the network template. If not, otherwise proceed to step S610.

단계(S609)에서 두 단백질 노드 P와 P_NT의 정보를 하나의 단백질 노드 P_NT로 통합한다. 즉, P와 연결된 모든 상호작용 관계 정보를 P_NT와의 관계로 변경한다. 또한, P_NT의 노드 이름과 서열 정보에 P의 노드 이름과 서열 정보를 추가한 다음 P노드를 삭제한다.In step (S609) and integrating information of the two nodes P and P proteins as a protein _NT _NT P node. That is, all the interaction relationship information associated with P is changed into the relationship with P _NT . Also, the node name and sequence information of P are added to the node name and sequence information of P _NT , and then the P node is deleted.

단계(S610)에서는 큐가 비어있는지를 판단하고, 큐가 비어있지 않다면 단계(S611)로 진행하여 큐가 비어질 때까지 큐의 단백질 노드 하나를 추출한다. 즉, 단계(S605)부터 단계(S610)를 반복 수행한다.In step S610, it is determined whether the queue is empty. If the queue is not empty, the process proceeds to step S611 and one protein node of the queue is extracted until the queue is empty. That is, step S605 is repeated from step S605.

단계(S610)의 판단 결과, 큐가 비어있는 것으로 판단되면, 단계(S612)로 진행하여 생성된 네트워크 템플릿 NT를 출력한다.If it is determined in step S610 that the queue is empty, the process proceeds to step S612 and the generated network template NT is output.

본 발명에 의하면, 기존에 밝혀진 이종간 유사 단백질들에 대한 상호작용 관계 데이터베이스를 이용하여 특정 단백질에 대한 밝혀지지 않은 상호작용 네트워크의 기본 템플릿을 자동으로 생성할 수 있다. 따라서, 사용자가 목표 단백질의 중요한 정보인 상호작용 네트워크를 전체적인 관점에서 점진적으로 구축할 수 있게 하여 네트워크 구축비용을 현저하게 감소시킬 수 있다.According to the present invention, a basic template of an undisclosed interaction network for a specific protein can be automatically generated using an interaction relationship database for previously known heterologous similar proteins. Therefore, it is possible to significantly reduce the network construction cost by allowing the user to gradually build an interactive network, which is important information of the target protein, from an overall perspective.

이상, 본 발명을 실시예에 근거하여 구체적으로 설명하였지만, 본 발명은 이러한 실시예에 한정되는 것이 아니라, 후술하는 특허청구범위의 요지를 벗어나지 않는 범위내에서 여러 가지 변형이 가능한 것은 물론이다.As mentioned above, although this invention was demonstrated concretely based on the Example, this invention is not limited to this Example, Of course, various changes are possible within the range which does not deviate from the summary of the claim mentioned later.

Claims

In a method for automatically generating a template for building a protein interaction network using an interaction relationship between a plurality of proteins, automatically generating an unknown interaction network template for a specific target protein,

Searching for a protein having a function similar to the target protein in a plurality of species;

A second step of generating an interaction network of the proteins retrieved in the first step based on the previously known interaction relationship database;

Establishing a similar relationship between the proteins present between the interaction networks created in the second step;

And a fourth step of generating a network template for the target protein by integrating protein nodes between different networks established by the third step in a similar relationship.

The method of claim 1,

The first step is,

Searching for any protein of the same name and different species as the target protein when the name of the target protein and the species name are input;

Retrieving from the protein DB other proteins of different species with the same name and sequence as the target protein;

As a result of the search, a method for automatically generating a template for building a protein interaction network, characterized in that the step of selecting a protein similar to the target protein.

The method of claim 1,

The second step,

When a protein of one of the heterologous similar proteins found in the first step is input, generating a network for the input protein from the interaction relationship DB;

Extracting a protein having a direct interaction with the input protein from the interaction relationship DB;

Gradually connecting the interaction relationship between the input protein and the protein having the interaction relationship to generate a network;

Inserting all non-visited proteins among the proteins with the interaction relationship into a queue and setting a visit flag;

Repeating the protein extraction process until the cue is empty, thereby generating an interaction network for the input protein.

The method of claim 1,

The third step,

When the n interaction network information generated from the analogous protein for the target protein is input, selecting an arbitrary number of network information among the input network information to establish a similar relationship between the proteins existing in the selected network. Wow;

Establishing a similar relationship between nodes of the same name for all protein nodes of the selected number of heterogeneous networks;

A method of automatically generating a template for constructing a protein interaction network, comprising: establishing sequence-based similarity for all protein nodes not set based on name.

The method of claim 1,

The fourth step,

Initializing a network template when network information in which similar relationships are established between heterologous protein nodes is input;

Establishing an initial visited protein node for breadth first search;

Extracting all protein nodes having an interaction relationship with the protein nodes in the network;

Inserting a first visited node among the extracted protein nodes into a queue and displaying a visit;

Determining whether the current protein node has a similar relationship with one protein node of the network template, and integrating the two protein nodes if determined to have a similar relationship;

If it is determined that the current protein node does not have a similar relationship with one protein node of the network template, it is determined whether the queue is empty, and if the queue is not empty, extracting one protein node of the queue until the queue is empty; ;

And, if it is determined that the queue is empty, outputting the generated network template.