KR20040051748A

KR20040051748A - Apparatus and method for performing genome sequence analysis and data management

Info

Publication number: KR20040051748A
Application number: KR1020020078715A
Authority: KR
Inventors: 인용호; 이근우; 이기용; 전지훈; 채수진
Original assignee: 바이오인포메틱스 주식회사
Priority date: 2002-12-11
Filing date: 2002-12-11
Publication date: 2004-06-19
Also published as: KR100650203B1

Abstract

PURPOSE: A system for analyzing a genome sequence and managing data, and a method thereof are provided to efficiently reduce a time needed for predicting genome and the time needed for analyzing a function/pathway analysis through a sequence similarity search by integrating a microorganism genome predicting/sequence similarity searching program and a database. CONSTITUTION: A reference database(21) stores a great quantity of verified genome sequence data. A sequence analysis result database(22) stores a genome sequence analysis result obtained from an experiment. An interface(26) receives one from the user information, the genome sequence information, and a search clue from a user(10). A sequence analyzer(24) analyzes the genome sequence for a genome based on the genome sequence information and the data in the reference database. A gene searcher(25) searches the data stored in the sequence analysis result database by responding to the search clue and extracts the detail information for the searching result from the reference database. A data manager(27) limits a user access authority for the sequence analysis result database, and manages the data input/output of the databases.

Description

Apparatus and method for performing genome sequence analysis and data management

본 발명은 유전체 분석 시스템에 관한 것으로, 미생물 유전체 서열을 분석하고 분석 결과를 데이터베이스로 구축하는 일련의 과정을 자동으로 실행하는 시스템 및 방법에 관한 것이다.The present invention relates to a genome analysis system, and more particularly, to a system and method for automatically executing a series of processes for analyzing microbial genome sequences and constructing analysis results into a database.

유전체(Genome)는 유전자(Gene)와 옴(Ome ; 전체)의 합성어로서, 생물체 내의 모든 유전자의 총체를 일컫는다. 일반적으로, 원핵 생물의 세포는 복잡한 구조의 유전자를 가지고 있는 반면, 진핵 생물은 다세포 생물에 비해 비교적 단순한 구조의 유전자를 가지고 있다. 따라서, 진핵 생물에 대해서는 지금까지 많은 연구가 수행되어 왔고, 그것의 유전자를 예측하거나 기능을 확인하기가 비교적 용이하였다.The genome is a compound word of gene and om, which refers to the sum of all genes in an organism. In general, prokaryotic cells have complex structures of genes, whereas eukaryotes have relatively simpler genes than multicellular organisms. Thus, many studies have been conducted on eukaryotic organisms, and it is relatively easy to predict its function or confirm its function.

기능 유전체학(또는, 유전체 기능 분석학 ; functional genomics)은 유전체 내에서 유전자들의 위치 및 기능(function)을 확인하고, 경로(pathway)를 규명하는 것을 말한다. 현재 수행되고 있는 기능 유전체학 연구는 크게 두 가지 기술로 구분될 수 있다. 첫 번째 기술은, 유전체 내에서 유전자의 위치를 확인하는 기술이다. 앞에서 기술한 바와 같이 원핵 생물의 경우 유전자의 구조가 비교적 간단하고, 이미 오래 전부터 이에 대한 연구가 수행되어왔기 때문에, 원핵 생물 유전자를 예측하는 프로그램은 현재 많이 개발되어 있다. 그 중에서도 Glimmer(Gene Locator and Interpolated Markov Modeler) 2.02는, 은닉 마코프 모델(Hidden Markov model)을 이용하여 원핵 생물의 유전자를 가장 정확히 예측하는 것으로 알려져 있다. 두 번째 기술은, 확인된 유전자의 기능 및 경로를 분석하는 기술이다. 현재 유전자의 기능을 확인하는 방법으로는 유사성 검색을 통한 방법이 이용되고 있으며, 블러스트(BLAST) 검색을 통한 방법이 가장 널리 이용되고 있다. 이 방법은 비교 대상이 되는 서열(sequence)과 레퍼런스 데이터 베이스간의 유사성 검색을 통하여 기존에 연구되어진 서열과 대상 서열간의 유사성을 밝히는 방법으로서, 두 서열간의유사성 검색을 매우 효율적으로 실행하고 있으며, 비교의 대상이 되는 데이터베이스를 바꿔가며 대량으로 비교를 수행할 수 있다.Functional genomics (or functional genomics) refers to identifying the location and function of genes in a genome and identifying pathways. Currently, functional genomics research can be divided into two technologies. The first technique is to identify the position of a gene in the genome. As described above, in the case of prokaryotes, the structure of genes is relatively simple, and since the research has been conducted for a long time, many programs for predicting prokaryotic genes have been developed. Among them, Glimmer (Gene Locator and Interpolated Markov Modeler) 2.02 is known to predict the prokaryotic gene most accurately by using the Hidden Markov model. The second technique is to analyze the function and pathway of the identified gene. Currently, a method of similarity search is used as a method for confirming the function of a gene, and a method through BLAST search is most widely used. This method finds the similarity between the previously studied sequence and the target sequence by searching for similarity between the sequence to be compared and the reference database. You can perform comparisons in bulk by changing the target database.

그러나, 이와 같은 유전자 예측 및 서열 유사성 검색 프로그램과, 데이터 분석에 사용되는 데이터베이스들은 서로 분리되어 제공되고 있기 때문에, 유전자 예측, 기능 분석, 경로 확인, 데이터의 저장 및 검색 기능이 각각 개별적으로 수행될 수밖에 없는 한계가 있다. 따라서, 미생물 서열분석의 자동화와 사용자의 이용에 어려움이 있다.However, since the gene prediction and sequence similarity search programs and the databases used for data analysis are provided separately from each other, gene prediction, function analysis, path identification, and data storage and retrieval functions must be performed separately. There is no limit. Therefore, there is a difficulty in automating microbial sequencing and using a user.

본 발명이 이루고자 하는 기술적 과제는, 기존에 제공되고 있는 미생물 유전자 예측 및 서열 유사성 검색 프로그램과, 기존의 데이터베이스를 통합하여, 유전자 예측에 소요되는 시간과, 서열 유사성 검색을 통한 기능 및 경로 분석에 소요되는 시간을 효율적으로 줄일 수 있는 유전체 서열 분석 및 데이터베이스 관리시스템과, 그 방법을 제공하는데 있다.The technical problem to be achieved by the present invention is to integrate the existing microbial gene prediction and sequence similarity search program and the existing database, the time required for gene prediction, and the function and path analysis through sequence similarity search The present invention provides a genome sequencing and database management system and a method for efficiently reducing the time required.

본 발명이 이루고자 하는 다른 기술적 과제는, 상기 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 있다.Another object of the present invention is to provide a computer-readable recording medium having recorded thereon a program for executing the method on a computer.

도 1은 본 발명의 바람직한 실시예에 따른 유전체 서열 분석 및 데이터관리 시스템의 전체 구성을 보여주는 블록도이다.1 is a block diagram showing the overall configuration of a genomic sequence analysis and data management system according to a preferred embodiment of the present invention.

도 2는 도 1에 도시된 유전체 서열 분석 및 데이터관리 시스템에서 수행되는 기능을 개략적으로 보여주는 도면이다.FIG. 2 is a view schematically showing a function performed in the genome sequence analysis and data management system shown in FIG. 1.

도 3은 도 2에 도시된 서열 입력 기능을 수행하기 위한 서열 입력 초기 화면을 보여주는 도면이다.3 is a view showing a sequence initial screen for performing the sequence input function shown in FIG.

도 4는 도 2에 도시된 유전체 서열 분석 기능이 수행되고 있는 상태를 보여주는 도면이다.4 is a view showing a state in which the genome sequence analysis function shown in FIG.

도 5는 도 2에 도시된 유전자 예측 기능의 수행 결과 얻어진 유전자 위치 정보 및 그에 해당되는 각각의 기능을 보여주는 도면이다.FIG. 5 is a diagram illustrating gene position information obtained as a result of performing the gene prediction function shown in FIG. 2 and respective functions corresponding thereto.

도 6 내지 도 8은 도 2에 도시된 유사성 검색 기능의 수행 과정 및 그 결과를 보여주는 도면이다.6 to 8 are diagrams illustrating a process and results of executing the similarity search function shown in FIG. 2.

도 9는 도 2에 도시된 도메인 검색 기능의 수행 결과를 보여주는 도면이다.FIG. 9 is a diagram illustrating a result of performing a domain search function shown in FIG. 2.

도 10 내지 도 12는 도 2에 도시된 기능별 검색 기능, 경로별 검색 기능, 및 키워드 검색 기능에 의해 얻어진 검색 결과를 각각 보여주는 도면이다.10 to 12 are diagrams showing the search results obtained by the function-specific search function, the path-specific search function, and the keyword search function shown in FIG. 2, respectively.

도 13 및 도 14는 도 2에 도시된 데이터 관리 기능을 수행하기 위해 필요로 하는, 각 데이터베이스별 사용자 권한의 구성 예를 보여주는 도면이다.13 and 14 are diagrams showing an example of the configuration of user authority for each database required to perform the data management function shown in FIG. 2.

도 15는 도 1에 도시된 레퍼런스 서열 데이터베이스 및 서열분석결과 데이터베이스에 저장되는 정보 및 상기 정보들간의 관계를 보여주는 도면이다.FIG. 15 is a diagram illustrating information stored in a reference sequence database and a sequence analysis result database shown in FIG. 1 and relationships among the information.

도 16은 도 1에 도시된 서열 분석부에서 수행되는 서열 분석 방법을 보여주는 흐름도이다.FIG. 16 is a flowchart illustrating a sequence analysis method performed by the sequence analyzer illustrated in FIG. 1.

도 17은 도 1에 도시된 유전자 검색부에서 수행되는 유전자 검색 방법을 보여주는 흐름도이다.17 is a flowchart illustrating a gene searching method performed by the gene searching unit illustrated in FIG. 1.

< 도면의 주요 부분에 대한 부호의 설명 ><Description of Symbols for Main Parts of Drawings>

10 : 클라이언트20 : 서버10: client 20: server

21 : 레퍼런스 서열 DB22 : 서열분석결과 DB21: reference sequence DB22: sequence analysis DB

23 : 서열 입력부24 : 서열 분석부23: sequence input unit 24: sequence analysis unit

25 : 유전자 검색부26 : 인터페이스25: gene search unit 26: interface

27 : 데이터 관리부28 : 사용자 관리부27: data management unit 28: user management unit

상기의 과제를 이루기 위하여 본 발명에 의한 유전체 서열 분석 및 데이터 관리 시스템은, 실험에 의해 얻어진 유전체 서열분석결과가 저장되는 제 1 데이터베이스; 검증된 다량의 유전자 서열 관련 데이터가 저장된 제 2 데이터베이스; 사용자로부터 사용자 정보, 유전체 서열 정보, 및 검색 단서 중 어느 하나를 받아들이는 인터페이스; 상기 유전체 서열 정보 및 상기 제 2 데이터베이스에 저장되어 있는 데이터를 근거로 하여 해당 유전체에 대한 유전체 서열 분석을 수행하는 서열 분석부; 상기 검색 단서에 응답해서 상기 제 1 데이터베이스에 저장되어 있는 데이터에 대한 검색을 수행하고, 검색 결과에 대한 상세 정보를 상기 제 2 데이터베이스로부터 추출하는 유전자 검색부; 및 상기 사용자 정보에 응답해서 상기 제 1 데이터베이스에 대한 사용자 접근 권한을 제한하고, 상기 유전체 서열 정보, 상기 서열분석결과 및 상기 검색 결과에 응답해서, 상기 제 1 데이터베이스 및 상기 제 2 데이터베이스의 데이터 입출력을 관리하는 데이터 관리부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the genome sequencing and data management system according to the present invention includes a first database for storing genome sequencing results obtained by experiments; A second database in which verified large amounts of gene sequence related data are stored; An interface to accept one of user information, genomic sequence information, and a search clue from the user; A sequence analyzer for performing genome sequence analysis on the genome based on the genome sequence information and the data stored in the second database; A gene search unit that searches for data stored in the first database in response to the search clue, and extracts detailed information about a search result from the second database; And limiting user access to the first database in response to the user information, and performing data input and output of the first database and the second database in response to the genome sequence information, the sequencing result, and the search result. Characterized in that it comprises a data management unit for managing.

상기의 과제를 이루기 위하여 본 발명에 의한 유전체 서열 분석 및 데이터 관리 방법은, (a) 사용자로부터 사용자 정보, 유전체 서열 정보, 및 검색 단서 중 어느 하나를 받아들이는 단계; (b) 상기 유전체 서열 정보 및 검증된 다량의 유전자 서열 관련 데이터가 저장되어 있는 제 2 데이터베이스의 데이터를 근거로 하여 해당 유전체에 대한 유전체 서열 분석을 수행하는 단계; (c) 상기 사용자 정보에 응답해서, 실험에 의해 얻어진 유전체 서열분석결과가 저장되는 제 1 데이터베이스에 대한 사용자 접근 권한을 판별하고, 해당 사용자가 상기 제 1 데이터베이스에 접근 가능한 경우 상기 서열분석결과를 제 1 데이터베이스에 저장하는 단계; 및 (d) 상기 검색 단서에 응답해서 상기 제 1 데이터베이스에 저장되어 있는 데이터에 대한 검색을 수행하고, 상기 검색 결과에 대한 상세 정보를 상기 제 2 데이터베이스로부터 추출하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the genome sequence analysis and data management method according to the present invention comprises the steps of: (a) receiving any one of user information, genome sequence information, and a search clue from a user; (b) performing genome sequence analysis on the genome based on the data of the second database storing the genome sequence information and the verified large amount of gene sequence related data; (c) in response to the user information, determine a user's access rights to a first database in which the genomic sequencing results obtained by the experiment are stored, and, if the user has access to the first database, provide the sequencing results. 1 storing in the database; And (d) searching for data stored in the first database in response to the search clue, and extracting detailed information about the search result from the second database.

상기의 과제를 이루기 위하여 본 발명에 의한 유전체 서열 분석 방법은, (a) 사용자로부터 유전체 서열 정보를 받아들이는 단계; (b) 상기 유전체 서열 정보로부터 유전자 부분을 예측하는 단계; (c) 검증된 다량의 유전자 서열 관련 데이터가 저장되어 있는 제 1 데이터베이스로부터 상기 유전자 부분으로 예측된 부분과 서열이 유사한 유전자들을 검색하는 단계; (d) 상기 제 1 데이터베이스로부터 상기 유전자에 대한 도메인 검색을 수행하는 단계; (e) 상기 (c) 및 (d) 단계에서 수행된 검색 결과, 상기 유전자가 일정 수준 이상의 유사성이 있는 경우, 상기 제 1 데이터베이스로부터 상기 유전자에 대한 상세 정보를 추출하는 단계; 및 (f) 상기 (e) 단계에서 추출된 결과를 유전체 분석 결과로서 제 2 데이터베이스에 저장하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the genome sequence analysis method according to the present invention comprises the steps of: (a) receiving genomic sequence information from a user; (b) predicting a gene portion from the genomic sequence information; (c) searching for genes that are similar in sequence to the predicted portion of the gene portion from a first database in which verified large amounts of gene sequence related data are stored; (d) performing a domain search for the gene from the first database; (e) extracting detailed information about the gene from the first database when the search results performed in the steps (c) and (d) show that the gene has a predetermined level or more similarity; And (f) storing the result extracted in step (e) in a second database as a result of genome analysis.

상기의 과제를 이루기 위하여 본 발명에 의한 유전자 검색 방법은, (a) 사용자에 의해서 기능별, 경로별, 또는 키워드 검색 중 어느 하나의 검색 방식이 선택되는 단계; (b) 상기 (a) 단계에서 기능별 검색이 선택된 경우, 실험에 의해 얻어진 유전체 서열분석결과가 저장된 제 1 데이터베이스로부터, 복수 개의 기능 카테고리들 중 임의의 기능 카테고리에 대한 유전자 검색이 수행되는 단계; (c) 상기 (a) 단계에서 경로별 검색이 선택된 경우, 상기 제 1 데이터베이스로부터, 복수 개의 경로 카테고리들 중 임의의 경로 카테고리에 대한 유전자 검색이 수행되는 단계; (d) 상기 (a) 단계에서 키워드 검색이 선택된 경우, 상기 제 1 데이터베이스로부터, 사용자로부터 입력된 상기 키워드에 대한 유전자 검색이 수행되는 단계; (e)상기 (b), (c) 및 (d) 단계에서 수행된 상기 검색 결과를 디스플레이 하는 단계; 및 (f) 상기 (e) 단계에서 디스플레이 되는 검색 결과 중 임의의 유전자가 선택되는 경우, 검증된 다량의 유전자 서열 관련 데이터가 저장된 제 2 데이터베이스로부터 상기 유전자에 대한 상세 정보가 검색되어 디스플레이 되는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the gene search method according to the present invention comprises the steps of: (a) selecting one of a search method by function, path, or keyword search by a user; (b) if a function-specific search is selected in step (a), performing a gene search for any function category among a plurality of function categories from a first database in which the genomic sequencing results obtained by the experiment are stored; (c) if a search for each path is selected in step (a), performing a gene search for any path category among a plurality of path categories from the first database; (d) if a keyword search is selected in step (a), performing a gene search for the keyword input from the user from the first database; (e) displaying the search results performed in steps (b), (c) and (d); And (f) when any gene is selected from the search results displayed in step (e), detailed information about the gene is searched and displayed from a second database in which a large amount of verified gene sequence related data is stored. It is characterized by including.

이하에서, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

도 1은 본 발명의 바람직한 실시예에 따른 유전체 서열 분석 및 데이터관리 시스템(100)의 전체 구성을 보여주는 블록도이다. 도 1을 참조하면, 본 발명에 따른 유전체 서열 분석 및 데이터관리 시스템(100)은, 네트워크(1)에 연결된 적어도 하나 이상의 클라이언트(10)와, 네트워크(1)를 통해 클라이언트(10)에게 유전체 서열 분석 서비스 및 데이터 관리 서비스를 제공하는 유전체 서열 분석 및 데이터관리 서버(20)를 포함한다.1 is a block diagram showing the overall configuration of a genomic sequence analysis and data management system 100 according to a preferred embodiment of the present invention. Referring to FIG. 1, the genomic sequence analysis and data management system 100 according to the present invention includes at least one client 10 connected to the network 1, and a genome sequence to the client 10 through the network 1. A genomic sequence analysis and data management server 20 providing analysis services and data management services.

서버(20)는 레퍼런스 서열 데이터베이스(21), 서열분석결과 데이터베이스(22), 서열 입력부(23), 서열 분석부(24), 유전자 검색부(25), 인터페이스(26), 및 데이터관리부(27)를 포함한다.The server 20 includes a reference sequence database 21, a sequence analysis result database 22, a sequence input unit 23, a sequence analysis unit 24, a gene search unit 25, an interface 26, and a data management unit 27. ).

레퍼런스 서열 데이터베이스(21)는, 검증된 다량의 유전자 서열 관련 데이터가 저장된 유전자 서열 데이터베이스로서, 유전자 서열 정보, 유전자 기능(function) 관련 정보, 유전자 경로(pathway) 관련 정보, 도메인(domain) 관련 정보, 등 유전자와 관련된 다양한 정보들이 저장된다. 그리고, 서열분석결과 데이터베이스(22)는, 실험에 의해 얻어진 유전체 서열분석결과가 저장되는 데이터베이스이다. 이 같은 서열분석결과 데이터베이스(22)는 분석 대상 등에 따라서, 각기 다른 이름으로 저장된 복수 개의 데이터베이스가 구성될 수 있다.The reference sequence database 21 is a gene sequence database in which a large amount of verified gene sequence related data is stored, and includes gene sequence information, gene function related information, gene path related information, domain related information, Various information related to genes is stored. The sequencing result database 22 is a database in which the genome sequencing results obtained by the experiment are stored. The sequence analysis result database 22 may be configured with a plurality of databases stored under different names according to the analysis target and the like.

인터페이스(26)는 클라이언트(10)로부터 사용자 정보와 유전체 서열 정보를 받아들여 이를 각 요구 블록들에게 전달하고, 유전체 서열 분석 및 데이터 관리 서버(20)에서 수행된 유전체 서열분석결과 및 유전자 검색 결과를 해당 클라이언트(10)에게 전달한다.The interface 26 receives the user information and the genomic sequence information from the client 10 and transmits it to each request block, and displays the genome sequencing results and the gene search results performed by the genome sequencing and data management server 20. Deliver to the client 10.

서열 입력부(23)는 사용자가 인터페이스(26)를 통해 유전체 서열 정보를 입력하게 되면, 상기 유전체 서열 정보를 일정한 형식(fasta 형식)으로 변환하여 서열 분석부(24)로 전달한다. 그리고, 상기 유전체 서열 정보를 데이터 관리부(27)에 등록함으로써, 사용자가 원하는 이름으로 서열분석결과 데이터베이스(22)가 생성되도록 한다.When the user inputs the genomic sequence information through the interface 26, the sequence input unit 23 converts the genomic sequence information into a predetermined format (fasta format) and transmits it to the sequence analysis unit 24. The genome sequence information is registered in the data management unit 27, so that the sequence analysis result database 22 is generated under a name desired by the user.

서열 분석부(24)는 Glimmer 2.02와 EMBOSS(European Molecular Biology Open Software Suite) 프로그램을 이용하여 서열 입력부(23)로부터 입력된 유전체 내의 유전자 부분을 예측하고, 유전자로 예측된 부분에 대해 유사성 검색을 수행하여 해당 유전자의 기능 및 경로를 분석한다. 그리고, 유사성 검색 종료 후에는 HMMER(hidden Markov model software)을 이용하여 유전자 부분 중 도메인에 해당되는 부분을 검색하게 된다. 이러한 일련의 분석과정이 끝나면 분석결과는 서열분석결과 데이터베이스(22)에 자동으로 저장되어, 사용자가 원하는 유전자를 검색할 수 있도록 한다.The sequence analysis unit 24 predicts a gene part in the genome input from the sequence input unit 23 using Glimmer 2.02 and EMBOSS (European Molecular Biology Open Software Suite) program, and performs a similarity search on the predicted part of the gene. Analyze the function and pathway of the gene. After the similarity search is completed, a part corresponding to the domain among the gene parts is searched using the hidden Markov model software (HMMER). After this series of analysis process, the analysis result is automatically stored in the sequence analysis result database 22, so that the user can search for the desired gene.

유전자 검색부(25)는, 사용자로부터 기능별, 경로별, 키워드별 검색 단서를 받아들여 서열분석결과 데이터베이스(22)에 저장되어 있는 데이터에 대해 기능별, 경로별, 또는 키워드별 검색을 수행한다. 예를 들어, 사용자로부터 소정의 검색 단서가 입력되면, 유전자 검색부(25)는 상기 검색 단서에 응답해서 서열분석결과 데이터베이스(22)로부터 소정의 유전자 데이터를 검색하고, 검색된 유전자에 대해 레퍼런스 서열 데이터베이스(21)로부터 상세 정보를 추출해 낸다.The gene search unit 25 receives a search clue for each function, path, and keyword from a user and performs a function, path, or keyword search on the data stored in the sequence analysis result database 22. For example, when a predetermined search clue is input from a user, the gene search unit 25 searches for predetermined genetic data from the sequencing result database 22 in response to the search clue, and searches a reference sequence database for the searched gene. Detailed information is extracted from (21).

데이터 관리부(27)에는 사용자 관리부(28)가 구비되어 있어, 인터페이스(26)를 통해 입력되는 사용자 정보에 응답해서 서열분석결과 데이터베이스(22)에 대한 사용자 접근 권한을 제한한다. 그리고, 서열분석부(24)로부터 입력되는 유전자 서열분석결과 및 유전자 검색부(25)로부터 입력되는 검색 결과에 응답해서, 레퍼런스 서열 데이터베이스(21) 및 서열분석결과 데이터베이스(22)의 데이터 입출력을 관리한다.The data manager 27 is provided with a user manager 28 to restrict the user's access rights to the sequencing results database 22 in response to user information input through the interface 26. Then, in response to the gene sequencing result input from the sequencing unit 24 and the search result input from the gene retrieval unit 25, data input / output of the reference sequence database 21 and the sequencing result database 22 are managed. do.

이와 같은 구성을 가지는 상기 시스템(100)은, 클라이언트/서버 방식의 시스템이기 때문에 클라이언트(10) 부분에서 수행되는 프로그램에 서버(20)관련 정보를 입력해야만 클라이언트/서버간의 통신이 원활하게 이루어지게 된다. 이를 위해 클라이언트(10)측에서 입력해야 하는 기본적인 서버(20)측 정보로는 서버 IP 주소(internet protocol address)와 포트 번호 등이 있다.Since the system 100 having such a configuration is a client / server type system, communication between the client and the server is smoothly achieved only by inputting the server 20 related information into a program executed in the client 10. . For this purpose, basic server 20 information to be input from the client 10 includes a server IP address and a port number.

도 2는 도 1에 도시된 유전체 서열 분석 및 데이터관리 시스템(100)에서 수행되는 기능을 개략적으로 보여주는 도면이다. 도 1 및 도 2를 참조하면, 본 발명에 따른 유전체 서열 분석 및 데이터 관리 시스템(100)에서 수행되는 유전체 서열분석 및 데이터 관리 기능(200)은, 크게 서열 입력 기능(230), 유전체 서열 분석 기능(240), 유전자 검색 기능(250), 및 데이터 관리 기능(270)으로 구분된다.FIG. 2 is a diagram schematically showing a function performed in the genome sequence analysis and data management system 100 shown in FIG. 1. Referring to FIGS. 1 and 2, the genome sequencing and data management function 200 performed in the genome sequencing and data management system 100 according to the present invention includes a sequence input function 230 and a genome sequencing function. 240, the gene search function 250, and the data management function 270.

서열 입력부(23)에서 수행되는 서열 입력 기능(230)은, 사용자가 인터페이스(26)를 통해 유전체 서열 정보를 입력하게 되면, 상기 유전체 서열 정보를 소정의 형식으로 변환하고, 이를 각 요구 블록에게 제공하는 기능을 제공한다.The sequence input function 230 performed by the sequence input unit 23 converts the genomic sequence information into a predetermined format when the user inputs the genomic sequence information through the interface 26 and provides it to each request block. Provides the ability to

도 3은 도 2에 도시된 서열 입력 기능(230)을 수행하기 위한 서열 입력 초기 화면을 보여주는 도면이다. 도 3을 참조하면, 사용자는 유전체 서열 분석 및 데이터관리 시스템(100)에서 분석하고자 하는 유전체 서열 정보를 입력하기 위해, 우선 상기 서버(20)에 구비되어 있는 복수 개의 서열분석결과 데이터베이스들 중 어느 하나를 선택하고, 분석될 유전체 서열 데이터를 입력하게 된다. 이와 같이 입력되는 유전체 서열 정보에 의해서 유전체 서열 분석 기능(240) 및 유전자 검색 기능(250)이 수행된다.3 is a diagram illustrating a sequence input initial screen for performing the sequence input function 230 shown in FIG. 2. Referring to FIG. 3, in order to input genomic sequence information to be analyzed in the genomic sequence analysis and data management system 100, first, any one of a plurality of sequencing results databases included in the server 20 is provided. And select genomic sequence data to be analyzed. The genome sequence analysis function 240 and the gene search function 250 are performed by the genome sequence information input as described above.

다시 도 1 및 도 2를 참조하면, 서열 분석부(24)에서 수행되는 유전체 서열 분석 기능(240)은, 유전자 예측 기능(242), 유사성 검색 기능(244), 및 도메인 검색 기능(246)으로 세분화된다. 서열 분석부(24)에서 수행되는 유전체 서열 분석 기능에 대한 상세 내용은 다음과 같다.Referring back to FIGS. 1 and 2, the genome sequencing function 240 performed by the sequence analysis unit 24 may include a gene prediction function 242, a similarity search function 244, and a domain search function 246. Subdivided. Details of the genome sequencing function performed in the sequence analysis unit 24 are as follows.

도 4는 도 2에 도시된 유전체 서열 분석 기능(240)이 수행되고 있는 상태를 보여주는 도면이다. 도 4를 참조하면, 사용자가 서열분석결과 데이터베이스 이름과 유전체 서열 데이터를 입력시킨 후 실행 단추를 누르게 되면, 유전체 서열분석에 대한 모든 과정(즉, 유전자 예측, 유사성 검색, 및 도메인 검색)이 서버(20)에서자동으로 수행된다. 이 경우, 사용자가 클라이언트측 프로그램을 종료시키게 되면, 서버(20)에서는 이에 상관없이 유전체 서열 분석을 계속 수행한다. 그리고, 사용자가 다시 클라이언트 프로그램을 실행시키게 되면 도 4와 같은 현재의 서열 분석 진행상태를 클라이언트(10)측에 표시하여 준다.4 is a diagram illustrating a state in which the genome sequencing function 240 shown in FIG. 2 is being performed. Referring to FIG. 4, when a user inputs a sequencing database name and genomic sequence data and presses an execution button, all processes for genome sequencing (ie, gene prediction, similarity search, and domain search) are executed by the server ( Is automatically performed. In this case, when the user terminates the client-side program, the server 20 continues to perform genome sequencing regardless. When the user executes the client program again, the current sequence analysis status as shown in FIG. 4 is displayed on the client 10 side.

도 5는 도 2에 도시된 유전자 예측 기능(242)의 수행 결과 얻어진 유전자 위치 정보 및 그에 해당되는 각각의 기능을 보여주는 도면이다. 도 2 및 도 5를 참조하면, 유전자 예측 기능(242)은 서열 입력부(23)로부터 입력된 유전체 서열 중 유전자 부분을 예측하는 기능으로서, Glimmer 2.02(2421) 및 EMBOSS(2422) 등과 같은 유전자 예측 프로그램을 통해 수행된다. 그리고, 예측 결과로서 도 5와 같이 유전자의 위치 정보 등을 발생한다. 각각의 유전자는 각 유전자별로 유전체 내에서 특정 위치에 존재하게 되는데, 이 때 유전자가 위치하는 정보는 유전자의 서열을 분석하는 데 중요한 단서가 된다. 여기서, 그래프 길이는 각 유전자의 길이를 나타내며, 위/아래는 유전자의 방향성을 나타내고, 각 그래프의 컬러는 유전자의 기능을 각각 나타낸다. 각 컬러별 유전자의 기능은 도 5의 우측에 표시되어 있다. 사용자는 원하는 서열의 위치를 입력하거나 해당 페이지를 입력하여 이동할 수 있으며, 그 외에도 줌 인, 줌 아웃 기능을 이용하여 화면을 확대시키거나 축소시킬 수 있다.FIG. 5 is a diagram illustrating gene position information obtained as a result of performing the gene prediction function 242 shown in FIG. 2 and respective functions corresponding thereto. 2 and 5, the gene prediction function 242 is a function of predicting a gene part of the genome sequence input from the sequence input unit 23, and a gene prediction program such as Glimmer 2.02 (2421) and EMBOSS (2422). Is done through. As a prediction result, gene position information and the like are generated as shown in FIG. Each gene is located at a specific position in the genome for each gene. At this time, the location of the gene is an important clue for analyzing the sequence of the gene. Here, the graph length indicates the length of each gene, up / down indicates the direction of the gene, and the color of each graph indicates the function of the gene, respectively. The function of each color gene is shown on the right side of FIG. The user can move by inputting the position of the desired sequence or by inputting the corresponding page. In addition, the user can enlarge or reduce the screen by using the zoom in and zoom out functions.

도 6 내지 도 8은 도 2에 도시된 유사성 검색 기능(244)의 수행 과정 및 그 결과를 보여주는 도면이다. 도 2와, 도 6 내지 도 8을 참조하면, 유사성 검색 기능(244)은 도 2에 도시된 바와 같이 레퍼런스 서열 데이터베이스(21)와 연계되어유전자 서열의 유사성을 검색하는 기능으로서, 블러스트 검색과 같이 독립된 형태로 그 기능을 수행할 수 있도록 독립된 인터페이스를 가진다.6 to 8 are diagrams illustrating a process of performing the similarity search function 244 shown in FIG. 2 and the results thereof. Referring to FIGS. 2 and 6 to 8, the similarity search function 244 is connected to the reference sequence database 21 to search for similarity of the gene sequence as shown in FIG. 2. Likewise, it has an independent interface to perform its functions in an independent form.

블러스트 검색은 도 6에 도시된 바와 같이, 선택된 옵션에 따라 다음과 같은 검색을 수행한다. 먼저, BLASTn 검색은 염기 서열간의 비교를 수행하는 기능이고, BLASTp 검색은 단백질 서열간에 비교를 수행하는 기능이고, BLASTx 검색은 입력한 염기서열을 6개의 프레임(frame)으로 변환하여 이를 단백질 서열 데이터베이스(21)와 비교하는 기능이다. 그리고, tBLASTn 검색은 염기 서열 데이터베이스를 6 프레임으로 변환하여 이를 입력된 단백질 서열과 비교하는 기능이고, tBLASTx 검색은 입력된 염기 서열과 염기서열 데이터베이스를 모두 6 프레임으로 변환하여 비교하는 기능이다.As shown in FIG. 6, the blast search performs the following search according to the selected option. First, BLASTn search is a function to perform a comparison between nucleotide sequences, BLASTp search is a function to perform a comparison between protein sequences, BLASTx search is to convert the input nucleotide sequence into six frames and to convert it into a protein sequence database ( 21). In addition, tBLASTn search is a function of converting the base sequence database into 6 frames and comparing it with the input protein sequence, and tBLASTx search is a function of converting all the input base sequences and the base sequence database into 6 frames.

사용자는 도 6의 화면에서 직접 서열을 입력하거나 일정한 형식(예를 들면, fasta 형식)의 서열 파일을 찾아 입력하고, 적당한 매개 변수를 선택한 후 실행 단추를 실행시키게 되면, 입력된 서열 데이터에 대한 유사성 검색이 수행되고 도 7과 같은 결과화면이 출력된다.When the user directly enters the sequence on the screen of FIG. 6 or finds and inputs a sequence file of a certain format (for example, fasta format), selects an appropriate parameter, and executes an execution button, the similarity to the input sequence data is obtained. The search is performed and a result screen as shown in FIG. 7 is output.

도 7을 참조하면, 블러스트 검색 결과는 HSP의 순서, 즉 부분 적인 유사성이 높은 순서대로 디스플레이 됨을 알 수 있다. 하지만, 부분적인 유사성은 서열의 길이에 관한 값이 보정이 되지 않기 때문에, 일반적으로 기대치(expect value)가 좀 더 중요한 값으로 고려되기도 한다.Referring to FIG. 7, it can be seen that the blast search results are displayed in the order of HSP, that is, in order of high partial similarity. Partial similarity, however, is generally considered to be a more important value because the value for the length of the sequence is not corrected.

이 때, 사용자가 그래픽한 결과 화면을 원할 경우, 도 7에 도시된 그래픽 결과(Graphical result) 단추를 누르게 되면, 사용자가 검색 결과를 이해하기 쉽도록도 8과 같은 화면이 출력된다.In this case, when the user wants a graphic result screen, when the graphic result button shown in FIG. 7 is pressed, a screen as shown in FIG. 8 is output so that the user can easily understand the search result.

도 9는 도 2에 도시된 도메인 검색 기능(246)의 수행 결과를 보여주는 도면이다. 도 2 및 도 9를 참조하면, 도메인 검색 기능(246)은 유전자 내 도메인을 검색하는 기능으로서, HMMER 프로그램을 사용하여 수행된다. HMMER 프로그램은 은닉 마코프 모델(Hidden Markov models)을 적용한 수학적 방법을 기초로 개발된 프로그램으로서, 모든 염기 서열을 인지한 후 비교 가능한 염기 서열을 찾아내는 기능을 수행한다. 이는, 블러스트 검색과 마찬가지로 레퍼런스 서열 데이터베이스(21)를 이용하여 검색을 수행하지만, 하나의 모델 염기 서열(model sequence)을 사용해 유사성을 검색하는 블러스트 검색 방식과는 그 검색 방식에서 차별되는 특징을 가진다.9 is a diagram illustrating a result of performing the domain search function 246 shown in FIG. 2. 2 and 9, the domain search function 246 is a function of searching a domain in a gene and is performed using the HMMER program. The HMMER program was developed on the basis of a mathematical method using hidden Markov models, which recognizes all sequences and then finds comparable sequences. This is performed by using the reference sequence database 21 similarly to the blast search, but differs from the blast search method in which similarity is searched using one model sequence. Have

다시 도 1 및 도 2를 참조하면, 유전자 검색부(25)에서 수행되는 유전자 검색 기능(250)은, 기능별 검색 기능(252), 경로별 검색 기능(254), 및 키워드 검색 기능(256)으로 세분화된다. 이 중 기능별 검색 기능(252)은 기능 카테고리별로 미리 18개의 그룹을 분류하여 그룹별 검색을 수행하고, 경로별 검색 기능(254)은 경로 카테고리별로 미리 51개의 그룹을 분류하여 그룹별 검색을 수행한다. 그리고, 키워드 검색 기능(256)은, 사용자로부터 입력된 키워드에 해당되는 데이터를 검색하는 기능을 수행한다.Referring back to FIGS. 1 and 2, the gene search function 250 performed by the gene search unit 25 may include a function search function 252, a path search function 254, and a keyword search function 256. Subdivided. Among them, the search function 252 for each function classifies 18 groups by function category in advance to perform group search, and the search for each route 254 classifies 51 groups for each path category in advance to perform group search. . The keyword search function 256 performs a function of searching for data corresponding to a keyword input from the user.

도 10 내지 도 12는 도 2에 도시된 기능별 검색 기능(252), 경로별 검색 기능(254), 및 키워드 검색 기능(256)에 의해 얻어진 검색 결과를 각각 보여주는 도면이다. 도 10 내지 도 12를 참조하면, 사용자에 의해 지정된 검색 방법에 따라 각기 다른 방식으로 검색이 수행됨을 알 수 있다. 도면에 도시되어 있는 바와 같이, 사용자가 검색 화면에서 검색 방법과, 기능 카테고리, 경로 카테고리, 또는 키워드를 지정하게 되면, 선택된 서열분석결과 데이터베이스 내 유전자 중에서 해당 기능 카테고리 또는 해당 경로 카테고리에 속하는 유전자에 대한 간략한 정보를 나타내는 리스트 정보가 표시된다. 이 경우, 사용자가 화면에 표시된 유전자 리스트 중 임의의 유전자를 클릭하게 되면 유전자 이름, 유전자 위치, 유전자 길이, 유전자 기능, 관련 경로, 도메인 정보 등과 같은 유전자 관련 정보가 레퍼런스 서열 데이터베이스(21)로부터 검색되어 디스플레이 된다.10 to 12 are diagrams showing search results obtained by the function-specific search function 252, the path-specific search function 254, and the keyword search function 256 shown in FIG. 2, respectively. 10 to 12, it can be seen that the search is performed in different ways according to the search method specified by the user. As shown in the figure, when a user designates a search method, a function category, a path category, or a keyword on a search screen, the genes belonging to the function category or the path category among the genes in the selected sequencing result database are selected. List information showing brief information is displayed. In this case, when a user clicks on any gene in the gene list displayed on the screen, gene-related information such as gene name, gene position, gene length, gene function, related path, domain information, and the like is retrieved from the reference sequence database 21. Is displayed.

다시 도 1 및 도 2를 참조하면, 데이터 관리부(27)에 의해 수행되는 데이터 관리 기능(270)은, 레퍼런스 서열 데이터베이스 관리 기능(210), 유전체 서열분석결과 관리 기능(220), 및 사용자 관리 기능(280)으로 세분화된다.Referring back to FIGS. 1 and 2, the data management function 270 performed by the data manager 27 includes a reference sequence database management function 210, a genome sequencing result management function 220, and a user management function. Subdivided into 280.

도 13 및 도 14는 도 2에 도시된 데이터 관리 기능(270)을 수행하기 위해 필요로 하는, 각 데이터베이스별 사용자 권한의 구성 예를 보여주는 도면이다.13 and 14 are diagrams showing an example of configuration of user authority for each database required to perform the data management function 270 illustrated in FIG. 2.

도 13에는 각각의 사용자 이름과, 사용자에 따라 이용 가능한 서열분석결과 데이터베이스(22)의 이름이 명시되어 있다. 데이터 관리부(27)는 사용자에 따라서 각각의 서열분석결과 데이터베이스(22)에 대한 사용자의 데이터베이스 접근 권한을 제한할 수 있다. 이 같은 기능을 통해 관리자는 사용자간의 보안을 유지할 수 있게 된다. 이 같은 기능을 사용자 관리 기능(280)이라 한다.In FIG. 13, each user name and the name of the sequencing database 22 available according to the user are specified. The data manager 27 may restrict a user's database access right to each sequencing result database 22 according to the user. This feature allows administrators to maintain security between users. Such a function is called a user management function 280.

도 14에는 각각의 서열분석결과 데이터베이스(22)의 이름과, 서열분석결과 데이터베이스(22)를 삭제할 수 있는 권한이 있는 사용자의 이름이 명시되어 있다.데이터 관리부(27)는 서열분석결과 데이터베이스(22)를 삭제할 수 있는 권한이 주어져 있는 사용자에 의해서 더 이상 필요하지 않은 서열분석결과 데이터베이스(22)를 삭제함으로서 서버의 하드용량에 걸리는 부하를 줄일 수 있다. 또한, 데이터 관리부(27)는 유전자 검색시 서열분석결과 데이터베이스(22)를 구성하는 각각의 테이블을 레퍼런스 서열 데이터베이스(21)에 연결함으로써, 상기 데이터베이스들(21, 22)에 대한 데이터 입출력을 제어한다. 이와 같은 기능을 레퍼런스 서열 데이터베이스 관리 기능(210) 및 유전체 서열분석결과 관리 기능(220)이라 한다.In Fig. 14, the name of each sequencing result database 22 and the name of a user who is authorized to delete the sequencing result database 22 are specified. The data management unit 27 includes a sequencing result database 22 By deleting the sequencing database 22 which is no longer needed by a user who is given the authority to delete), the load on the server's hard capacity can be reduced. In addition, the data management unit 27 controls the data input and output to the databases (21, 22) by connecting each table constituting the sequencing database (22) to the reference sequence database (21) during gene search . Such functions are referred to as reference sequence database management function 210 and genomic sequencing result management function 220.

이 때, 서열분석결과 데이터베이스(22)에는 레퍼런스 서열 데이터베이스(21)로부터 추출된 유전자의 상세 정보 자체가 저장되는 대신, 해당 데이터가 저장되어 있는 레퍼런스 서열 데이터베이스(21)의 엔트리 정보가 저장된다. 이와 같이 구성되는 서열분석결과 데이터베이스(22) 및 레퍼런스 서열 데이터베이스(21)의 데이터 모델은 다음과 같다.At this time, instead of storing the detailed information itself of the gene extracted from the reference sequence database 21, the sequence analysis result database 22 stores entry information of the reference sequence database 21 in which the corresponding data is stored. The data models of the sequence analysis result database 22 and the reference sequence database 21 configured as described above are as follows.

도 15는 도 1에 도시된 레퍼런스 서열 데이터베이스(21) 및 서열분석결과 데이터베이스(22)에 저장되는 정보 및 상기 정보들간의 관계를 보여주는 도면으로서, 각 블록들에 연결된 화살표의 방향은 자신이 참조하는 데이터가 존재하는 테이블을 의미한다.FIG. 15 is a diagram showing the information stored in the reference sequence database 21 and the sequencing result database 22 shown in FIG. 1 and the relationship between the information. It means the table where the data exists.

도 15를 참조하면, 레퍼런스 서열 데이터베이스(21)는 COG 기능 테이블(Clusters of Orthologous Groups of proteins (COGs) function table ; 212), COG 경로 테이블(214), KEGG(Kyoto Encyclopedia of Genes and Genomes) 정보 테이블(216), 및 SwissPfam 정보 테이블(218)을 포함한다. 각 테이블(212-218)에 저장되는 데이터는, 레퍼런스 서열 데이터베이스(21)에 저장되어 있는 검증된 다량의 데이터들을 각 카테고리별로 구분한 것으로서, 여기에는 유전자 이름, 개체의 종, 유전자의 기능, 유전자 경로, 키워드 정보 등이 포함된다.Referring to FIG. 15, the reference sequence database 21 includes a Clusters of Orthologous Groups of Proteins (COGs) function table (212), a COG Path Table (214), a Kyoto Encyclopedia of Genes and Genomes (KEGG) information table. 216, and SwissPfam information table 218. The data stored in each of the tables 212-218 is divided into a large amount of verified data stored in the reference sequence database 21 for each category, and includes the gene name, the species of the individual, the function of the gene, and the gene. Path, keyword information, and the like.

이 중 COG 기능 정보 테이블(212)에는 각 유전자에 대한 기능 정보가 저장되는데, 이 기능 정보를 근거로 하여 각 유전자의 기능이 예측된다. COG 경로 정보 테이블(214)에는 각 유전자에 대한 경로 정보가 저장되어 있어, 유전자와 관련된 경로를 예측하는 데 사용된다. 그리고, KEGG 정보 테이블(216)에는 유전자 이름, 개체의 종, KEGG 경로 정보 등이 저장되고, SwissPfam 정보 테이블(218)에는 유전자 이름, 도메인 이름, Pfam 데이터베이스 접근 번호(accession number), 위치 정보 등이 저장된다.The COG function information table 212 stores the function information for each gene, and the function of each gene is predicted based on this function information. The COG path information table 214 stores path information for each gene and is used to predict a path related to the gene. The KEGG information table 216 stores gene names, species of individuals, KEGG path information, and the like, and the SwissPfam information table 218 stores gene names, domain names, Pfam database access numbers, location information, and the like. Stored.

서열분석결과 데이터베이스(22)는, 유사성 검색 결과 테이블(222)과 도메인 검색 결과 테이블(224)을 포함한다. 유사성 검색 결과 테이블(222)은 레퍼런스 서열 데이터베이스(21)의 COG 기능 정보 테이블(212)과 COG 경로 정보 테이블(214)에 연결되어, 각 유전자의 기능과 경로를 분석하는데 사용된다. 도메인 검색 결과 테이블(224)은 레퍼런스 서열 데이터베이스(21)의 KEGG 정보 테이블(216) 및 SwissPfam 정보 테이블(218)과 연결되어, SwissPfam 정보 테이블(218)로부터 유전자 이름, 도메인 이름, Pfam 데이터베이스 접근 번호, 위치 정보 등과 같은 도메인 정보를 제공받는다.The sequencing result database 22 includes a similarity search result table 222 and a domain search result table 224. The similarity search result table 222 is connected to the COG function information table 212 and the COG path information table 214 of the reference sequence database 21 and used to analyze the function and the path of each gene. The domain search result table 224 is linked with the KEGG information table 216 and the SwissPfam information table 218 of the reference sequence database 21, and from the SwissPfam information table 218, the gene name, domain name, Pfam database access number, Domain information such as location information is provided.

데이터 관리부(27)에는 사용자 정보 테이블(282)이 구비되어 있어, 사용자 각각에 대한 정보를 관리하는데 이용된다. 사용자 정보 테이블(282)에는 사용자 식별자, 암호, 사용자 이름, 데이터베이스에 대한 사용자 접근 권한 등의 정보가 포함된다.The data management unit 27 is provided with a user information table 282, which is used to manage information for each user. The user information table 282 includes information such as a user identifier, a password, a user name, a user's access right to a database, and the like.

도 15를 참조하여, 데이터 관리부(27)에 의해 수행되는 레퍼런스 서열 데이터베이스 관리 기능(210), 유전체 서열분석결과 데이터베이스 관리 기능(220), 및 사용자 관리 기능(280)을 살펴보면 다음과 같다.Referring to FIG. 15, the reference sequence database management function 210, the genome sequencing result database management function 220, and the user management function 280 performed by the data manager 27 are as follows.

먼저, 인터페이스(26)를 통해 사용자 정보가 입력되면, 데이터 관리부(27)는 사용자 정보 테이블(282)에 저장되어 있는 사용자 정보를 근거로 하여 데이터베이스(21, 22)에 대한 데이터 접근을 허가할 것인지 여부를 판별한다. 판별 결과, 해당 사용자가 데이터베이스(21, 22)에 접근 가능한 경우, 서열분석부(24)로부터 입력된 유전체 서열분석결과를 서열분석결과 데이터베이스(22)에 저장한다.First, when user information is input through the interface 26, the data manager 27 permits data access to the databases 21 and 22 based on the user information stored in the user information table 282. Determine whether or not. As a result of the determination, when the user can access the databases 21 and 22, the genome sequencing results input from the sequencing unit 24 are stored in the sequencing results database 22.

한편, 사용자가 인터페이스(26)를 통해 유전체 서열을 입력하게 되면, 서열 입력부(23)는 상기 유전체 서열을 소정의 형식으로 변환하여 서열 분석부(24)로 전달한다. 서열 분석부(24)는 유전자 예측을 수행하여 유전자 위치 정보와 유전자 이미지 정보를 발생한다. 그리고, 유전자 예측에 의해 발생된 유전자 위치 정보를 근거로 하여 유사성 검색과, 도메인 검색이 차례로 수행된다. 그리고, 서열 분석부(24)로부터 발생된 도메인 검색 결과와, 유사성 검색 결과는 각각 데이터 관리부(27)로 입력되고, 데이터 관리부(27)는 도메인 검색 결과와 유사성 검색 결과를 서열분석결과 데이터베이스(22)의 도메인 검색 결과 테이블(224)과 유사성 검색 결과 테이블(222)에 각각 저장한다. 이 때, 서열분석결과 데이터베이스(22)의 유사성 검색 결과 테이블(222)은, 레퍼런스 서열 데이터베이스(21)의 COG 기능 정보 테이블(212) 및 COG 경로 정보 테이블(214)에 연결되어, 각 유전자의 기능과 경로를 분석하는데 사용된다. 그리고, 도메인 검색 결과 테이블(224)은 레퍼런스 서열 데이터베이스(21)의 KEGG 정보 테이블(216) 및 SwissPfam 정보 테이블(218)과 연결되어, SwissPfam 정보 테이블(218)로부터 유전자 이름, 도메인 이름, Pfam 데이터베이스 접근 번호, 위치 정보 등과 같은 도메인 정보를 제공받게 된다.On the other hand, when the user inputs the genomic sequence through the interface 26, the sequence input unit 23 converts the genomic sequence into a predetermined format and delivers it to the sequence analysis unit 24. The sequence analysis unit 24 performs gene prediction to generate gene position information and gene image information. Then, similarity search and domain search are performed in turn based on the gene position information generated by the gene prediction. Then, the domain search result generated from the sequence analyzer 24 and the similarity search result are respectively input to the data manager 27, and the data manager 27 outputs the domain search result and the similarity search result to the sequencing result database 22. Are stored in the domain search result table 224 and the similarity search result table 222, respectively. At this time, the similarity search result table 222 of the sequencing result database 22 is connected to the COG function information table 212 and the COG path information table 214 of the reference sequence database 21, and functions of each gene. Used to analyze paths. The domain search result table 224 is then linked with the KEGG information table 216 and the SwissPfam information table 218 of the reference sequence database 21 to access the gene name, domain name, Pfam database from the SwissPfam information table 218. Domain information such as number and location information is provided.

도 16은 도 1에 도시된 서열 분석부(24)에서 수행되는 서열 분석 방법을 보여주는 흐름도이다. 도 16을 참조하면, 서열 분석부(24)는 먼저 사용자로부터 유전체 서열을 입력받고(2410 단계), 유전체 서열 중 유전자 부분을 예측한다(2420 단계).FIG. 16 is a flowchart illustrating a sequence analysis method performed by the sequence analyzer 24 illustrated in FIG. 1. Referring to FIG. 16, the sequence analyzer 24 first receives a genome sequence from a user (step 2410) and predicts a gene part of the genome sequence (step 2420).

이어서, 유전자로 예측된 부분의 기능과 경로를 분석하기 위해, 블러스트 검색과 같은 유사성 검색이 수행된다(2440 단계). 유사성 검색 결과, 검색된 유전자의 서열이 일정 수준 이상의 유사성이 있는 것으로 판단되면, 서열분석결과 데이터베이스(22)의 유사성 검색결과 테이블(222)은 레퍼런스 서열 데이터베이스(21)의 COG 기능 정보 테이블(212)과 COG 경로 정보 테이블(214)에 연결되어, 유전자의 기능과 경로를 확인하다(2480 단계). 그리고, 확인된 유전자의 기능 및 경로 정보를 서열분석결과 데이터베이스(22)에 저장함으로써, 서열분석결과 데이터베이스(22)를 구축하게 된다(2490 단계).Subsequently, a similarity search, such as a blast search, is performed to analyze the function and pathway of the gene predicted part (step 2440). Similarity search results, if it is determined that the sequence of the searched gene has a certain level of similarity or more, the similarity search result table 222 of the sequencing database 22 and the COG function information table 212 of the reference sequence database 21 It is connected to the COG path information table 214 to confirm the function and the path of the gene (step 2480). Then, by storing the function and path information of the identified gene in the sequencing database 22, the sequencing database 22 is constructed (step 2490).

한편, 2420 단계에서 유전자 부분으로 예측된 부분은, 유사성 검색이 수행되는 동안 도메인 검색이 수행되고(2460 단계), 도메인 검색결과가 일정 수준 이상의 유사성이 있는 것으로 판단되면, 서열분석결과 데이터베이스(22)의 도메인 검색결과 테이블(224)은 레퍼런스 서열 데이터베이스(21)의 KEGG 정보 테이블(216) 및 SwissPfam 정보 테이블(218)에 연결되어, 각 유전자 내에 속하는 도메인의 이름, Pfam 데이터베이스 접근 번호, 및 위치 정보 등을 확인한다(2480 단계). 그리고, 확인된 유전자의 기능 및 경로 정보를 서열분석결과 데이터베이스(22)에 저장함으로써, 서열분석결과 데이터베이스(22)를 구축하게 된다(2490 단계).On the other hand, the part predicted as the gene part in step 2420, the domain search is performed while the similarity search is performed (step 2460), and if the domain search result is determined to have a certain level or more similarity, the sequence analysis result database 22 Domain search result table 224 is linked to the KEGG information table 216 and SwissPfam information table 218 of the reference sequence database 21, name of the domain belonging to each gene, Pfam database access number, location information, etc. Check (step 2480). Then, by storing the function and path information of the identified gene in the sequencing database 22, the sequencing database 22 is constructed (step 2490).

이와 같은 방법에 의해 구축된 서열분석결과 데이터베이스(22)는, 유전자 검색부(25)에 의해 수행되는 기능 검색, 경로 검색, 및 키워드 검색을 통해서 사용자가 원하는 정보를 분석하여 제공할 수 있게 된다.The sequence analysis result database 22 constructed by such a method may analyze and provide information desired by a user through a function search, a path search, and a keyword search performed by the gene search unit 25.

도 17은 도 1에 도시된 유전자 검색부(25)에서 수행되는 유전자 검색 방법을 보여주는 흐름도이다. 도 17을 참조하면, 먼저 사용자에 의해 기능 검색, 경로 검색과 키워드 검색 중 어느 하나의 검색 방법이 선택된다(2510 단계).17 is a flowchart illustrating a gene searching method performed by the gene searching unit 25 shown in FIG. 1. Referring to FIG. 17, first, a user selects any one of a function search, a path search, and a keyword search (step 2510).

2510 단계에서 기능 검색이 선택된 경우, 먼저 18개의 항목으로 분류되어진 기능 카테고리들 중 어느 하나가 선택된다(2522 단계). 이어서, 서열분석결과 데이터베이스(22)에 저장되어진 유전자 중 선택된 기능 그룹에 속하는 유전자가 검색되고(2524 단계), 검색 결과가 디스플레이 된다(2570 단계). 사용자는 2570 단계에서 디스플레이 되는 검색결과를 이용하여 각 유전자에 대한 상세 정보를 획득할 수 있게 된다(2580 단계). 예를 들어, 서열분석결과 데이터베이스(22)에 저장되어진 유전자 중 선택된 기능 그룹에 속하는 유전자가 디스플레이 되는 경우, 사용자가 해당 유전자를 클릭하게 되면, 상기 유전자에 대한 상세 정보가 레퍼런스 서열 데이터베이스(21)로부터 검색되어 디스플레이 된다.If the function search is selected in step 2510, one of the function categories classified into 18 items is first selected (step 2522). Subsequently, a gene belonging to the selected functional group among the genes stored in the sequencing result database 22 is searched (step 2524), and the search result is displayed (step 2570). The user may acquire detailed information about each gene using the search result displayed in step 2570 (step 2580). For example, when a gene belonging to a selected functional group among the genes stored in the sequencing result database 22 is displayed, when the user clicks on the gene, detailed information about the gene is obtained from the reference sequence database 21. It is retrieved and displayed.

2510 단계에서 경로 검색이 선택된 경우, 먼저 51개의 항목으로 분류되어진 경로 카테고리들 중 어느 하나가 선택된다(2542 단계). 이어서, 서열분석결과 데이터베이스(22)에 저장되어진 유전자 중 선택된 경로 그룹에 속하는 유전자가 검색되고(2544 단계), 검색 결과가 디스플레이 된다(2570 단계). 이 때, 사용자는 2570 단계에서 디스플레이 되는 검색결과에서 임의의 유전자를 선택함으로써, 레퍼런스 서열 데이터베이스(21)로부터 상기 유전자에 대한 상세 정보를 획득할 수 있게 된다(2580 단계).When the route search is selected in step 2510, one of the route categories classified into 51 items is first selected (step 2254). Then, a gene belonging to the selected path group among the genes stored in the sequencing result database 22 is searched (step 2544), and the search result is displayed (step 2570). In this case, the user selects an arbitrary gene from the search result displayed in operation 2570, so that detailed information about the gene may be obtained from the reference sequence database 21 (operation 2580).

2510 단계에서 키워드 검색이 선택된 경우, 먼저 사용자로부터 검색할 유전자에 대한 키워드가 입력된다(2562 단계). 그리고, 입력된 키워드에 해당하는 유전자가 서열분석결과 데이터베이스(22)로부터 검색된 후(2564 단계), 검색된 결과가 디스플레이 된다(2570 단계). 이어서, 사용자는 2570 단계에서 디스플레이 되는 검색결과를 이용하여 각 유전자에 대한 상세 정보를 얻게 된다(2580 단계).When keyword search is selected in step 2510, a keyword for a gene to be searched is first input from the user (step 2562). Then, after the gene corresponding to the input keyword is searched from the sequencing analysis database 22 (step 2564), the searched result is displayed (step 2570). Subsequently, the user may obtain detailed information about each gene using the search result displayed in step 2570 (step 2580).

앞에서 설명한 바와 같이, 본 발명에 따른 유전체 서열 분석 및 데이터 관리 시스템은, 기능 검색, 경로 검색 및 키워드 검색을 제공함으로써, 사용자가 원하는 유전자의 상세 정보를 제공하게 된다.As described above, the genome sequencing and data management system according to the present invention provides detailed information of genes desired by a user by providing a function search, a path search, and a keyword search.

이상에서, 본 발명의 실시예로서 임의의 레퍼런스 서열 데이터베이스를 사용한 유전체 서열 분석을 통해 서열분석결과 데이터베이스를 구축하고, 두 데이터베이스의 검색 및 분석 기능을 통합한 유전체 서열 분석 및 데이터 관리 시스템에 대해 구체적으로 예시되었으나, 그밖에도 다양한 종류의 유전체에 대한 데이터베이스들이 본 발명에 적용될 수 있고, 클라이언트/서버 환경은 물론, 웹 환경이나 로컬환경에서도 본 발명을 적용할 수 있다.In the above, as an embodiment of the present invention, a sequencing result database is constructed through genomic sequencing using an arbitrary reference sequence database, and a genomic sequencing and data management system integrating search and analysis functions of the two databases is specifically Although illustrated, other databases for various kinds of genomes may be applied to the present invention, and the present invention may be applied to a web environment or a local environment as well as a client / server environment.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드로 저장되고 실행될 수 있다.The invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, and may also be implemented in the form of a carrier wave (for example, transmission over the Internet). Include. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상에 설명한 바와 같이, 본 발명에 의한 유전체 서열 분석 및 데이터 관리 시스템에 의하면, 각각 떨어져 존재하는 유전체 서열 분석 프로그램, 서열분석결과 데이터베이스, 및 레퍼런스 서열 데이터베이스가 하나로 통합되어 유전체 서열의 분석, 분석 결과의 저장, 저장된 데이터의 검색 기능을 효율적으로 수행할 수 있다. 따라서 클라이언트/서버 환경 하에서 유전체 서열 분석과, 분석 결과에 대한 데이터 관리를 용이하게 수행할 수 있다.As described above, according to the genome sequencing and data management system according to the present invention, the distant genomic sequencing program, the sequencing result database, and the reference sequence database are integrated into one to analyze the genome sequence and the analysis result. The function of storing and retrieving stored data can be performed efficiently. Therefore, genome sequencing and data management of analysis results can be easily performed in a client / server environment.

Claims

A first database storing genomic sequencing results obtained by experiments;

A second database in which verified large amounts of gene sequence related data are stored;

An interface to accept one of user information, genomic sequence information, and a search clue from the user;

A sequence analyzer for performing genome sequence analysis on the genome based on the genome sequence information and the data stored in the second database;

A gene search unit that searches for data stored in the first database in response to the search clue, and extracts detailed information about a search result from the second database; And

Restrict user access to the first database in response to the user information, and manage data input and output of the first database and the second database in response to the genome sequence information, the sequencing result, and the search result. A genome sequence analysis and data management system comprising a data management unit.

The method of claim 1,

The interface receives one of the user information, the genomic sequence information, and the search clue from at least one or more clients connected through a network, and transmits it to any one of the data manager, the sequence analyzer, and the gene searcher. And delivering the sequencing result and the search result to the client through the network.

The method of claim 1, wherein the second database,

A Cluster of Orthologous Groups (COG) function information table for storing function information for each gene;

A COG path information table in which path information for each gene is stored; And

A genome sequence analysis and data management system comprising a domain information table in which domain information for each gene is stored.

The method of claim 3, wherein the first database,

A similarity search result table connected to the COG function information table and the COG path information table and receiving the function information and the path information for each gene; And

And a domain search result table connected to the domain information table and receiving the domain information for each gene.

The method of claim 1,

The sequence analysis unit performs a gene prediction function for predicting a gene part from the genome sequence information, a similarity search function for searching for a gene similar in sequence to the part predicted by the gene, and a domain search function for searching a domain of the gene Genomic sequence analysis and data management system, characterized in that the.

The method of claim 1,

The gene search unit, a function-specific search for searching genes classified into a predetermined group by function, a path-by-path search searching for genes classified into a predetermined group by path, and a keyword to search for a gene corresponding to a keyword input from a user A genomic sequencing and data management system comprising performing any one of searches.

The method of claim 1,

And the data manager comprises a user information table including a user identifier, a password, a user name, and user access authority information for the first database.

The method of claim 7, wherein

The data manager, based on the information stored in the user information table, genome sequence analysis and data management system for each user to manage the data access and database deletion authority for the first database .

The method of claim 7, wherein

The data manager may store the genome sequencing result in the first database when the user has access to the first database, and perform the detailed search for the data stored in the first database. A genomic sequencing and data management system linking a database to said second database.

The method of claim 9,

The first database stores the entry information of the second database in which the corresponding data is stored, instead of storing the detailed information of the gene extracted from the second database itself. .

(a) accepting one of user information, genomic sequence information, and a search clue from a user;

(b) performing genome sequence analysis on the genome based on the data of the second database storing the genome sequence information and the verified large amount of gene sequence related data;

(c) in response to the user information, determine a user's access rights to a first database in which the genomic sequencing results obtained by the experiment are stored, and, if the user has access to the first database, provide the sequencing results. 1 storing in the database; And

(d) performing a search on the data stored in the first database in response to the search clue, and extracting detailed information about the search result from the second database. How to analyze and manage your data.

The method of claim 11,

Wherein said user information, said genomic sequence information, and said search clue are input from at least one or more clients connected via a network.

The method of claim 11,

The method of genome sequencing and data management further comprises (e) providing the sequencing results and the search results to the client via the network.

The method of claim 11, wherein the second database,

Genome sequence analysis and data management method comprising a domain information table for storing domain information for each gene.

The method of claim 14, wherein the first database,

The method of claim 11, wherein step (b) comprises:

(b-1) predicting a gene portion from the genomic sequence information;

(b-2) searching for a gene similar in sequence to the part predicted by the gene from the second database; And

(b-3) performing a domain search for the gene from the second database.

The method of claim 11,

In step (d), a function-specific search for searching genes classified into a predetermined group by function, a path-specific search for searching genes classified into a predetermined group by path, and a gene corresponding to a keyword input from a user Genome sequence analysis and data management method characterized in that to perform any one of the keyword search.

The method of claim 11,

The first database stores the entry information of the second database in which the corresponding data is stored, instead of storing the detailed information of the gene searched in the step (d) itself. .

(a) accepting genomic sequence information from a user;

(b) predicting a gene portion from the genomic sequence information;

(c) searching for genes that are similar in sequence to the predicted portion of the gene portion from a first database in which verified large amounts of gene sequence related data are stored;

(d) performing a domain search for the gene from the first database;

(e) extracting detailed information about the gene from the first database when the search results performed in the steps (c) and (d) show that the gene has a predetermined level or more similarity; And

(f) storing the result extracted in step (e) as a genome analysis result in a second database.

The method of claim 19,

Wherein the genomic sequence information is input from at least one or more clients connected through a network.

The method of claim 19,

The genomic sequencing method may further include (g) providing the analysis result of the genomic sequence information to the client through the network.

The method of claim 19,

The genome sequencing method may further manage user information accessible to the second database to maintain security between users.

(a) selecting one of a function, path, or keyword search by a user;

(b) if a function-specific search is selected in step (a), performing a gene search for any function category among a plurality of function categories from a first database in which the genomic sequencing results obtained by the experiment are stored;

(c) if a search for each path is selected in step (a), performing a gene search for any path category among a plurality of path categories from the first database;

(d) if a keyword search is selected in step (a), performing a gene search for the keyword input from the user from the first database;

(e) displaying the search results performed in steps (b), (c) and (d); And

(f) if any gene is selected from the search results displayed in step (e), the detailed information about the gene is searched and displayed from a second database in which a large amount of verified gene sequence related data is stored. Gene search method characterized in that.

The method of claim 23,

The search clue for performing the function, path, and keyword search is input from at least one or more clients connected through a network.

The method of claim 23,

The search results performed in the steps (e) and (f) are displayed on the client side via the network.

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 11 to 25 on a computer.