KR20040064763A

KR20040064763A - Client/server based workbench system and method for expressed sequence tag analysis

Info

Publication number: KR20040064763A
Application number: KR1020030001600A
Authority: KR
Inventors: 인용호; 김형용; 노미라; 채수진
Original assignee: 바이오인포메틱스 주식회사
Priority date: 2003-01-10
Filing date: 2003-01-10
Publication date: 2004-07-21
Also published as: KR100513266B1

Abstract

PURPOSE: A system and a method for analyzing an EST(Expressed Sequence Tag) sequence based on a client/server are provided to efficiently integrate gene sequence databases and sequence analysis tools, and analogize a function of the EST sequence by extracting the similar candidate gene/protein domain information from a sequence search result and using the extracted data. CONSTITUTION: An I/O(Input/Output) manager(240) receives the EST sequence data calculated by an experiment of a user. A sequence input part(250) converts the EST sequence data into a predetermined format and stores it to the first database(220). Annotation analyzers(260,270) search the similarity/protein domain between the EST sequence data stored in the first database and the sequence data stored in the second database(230), which is stored the verified genes and the protein sequence data, and store at the result to the first database. An analysis result searcher(280) searches the data stored in the first database by responding to a search key inputted from the user.

Description

Client / server based workbench system and method for expressed sequence tag analysis}

본 발명은 유전체 분석 시스템에 관한 것으로, 특히 생물학연구 방법 중 하나인 EST(Expressed Sequence Tag) 서열 결정법에 의해 얻어진 서열 및 서열 분석 정보를 저장, 분석 및 검색 할 수 있는 시스템 및 그 방법에 관한 것이다.The present invention relates to a genome analysis system, and more particularly, to a system and a method for storing, analyzing, and retrieving sequence and sequencing information obtained by an EST (Expressed Sequence Tag) sequencing method.

EST는 Expressed Sequence Tag의 약자로서, 생명체 내에서 기능을 하기 위해 원본 유전체(genome) 서열에서 mRNA(messenger RNA)로 발현되는 유전자의 조각을 일컫는다. 일반적으로, 원핵 생물의 유전체는 인트론(intron)과 엑손(exon)이 따로 구분되지 않고 유전체 서열이 바로 mRNA로 전사되어 단백질이 형성된다. 이에 반해 진핵 생물의 각 세포는 같은 유전체와 각기 다른 양상의 프로테옴을 가진다. 즉, 진핵 생물은 유전체 서열로부터 시간과 공간적으로 서로 다른 RNA 전사가 이루어진후 인트론이 잘려나가는 등의 전사 후 변형을 거친 후 변형된 mRNA가 생성된다. 이 같은 mRNA는 역전사 기법을 사용하여 cDNA(complementary DNA) 라이브러리 형태로 대량으로 실험적으로 뽑아낼 수 있으며, 그 cDNA 서열의 단편을 EST 라고 한다. 따라서, 진핵 생물의 EST들을 연구하는 것은, 유전체 전체를 연구하는 것에 비해 기능을 하는 유전자를 밝혀내는 데 효과적인 실험 기법이라 할 수 있다.EST is an abbreviation of Expressed Sequence Tag, and refers to a fragment of a gene expressed as mRNA (messenger RNA) in the original genome sequence to function in life. In general, in the prokaryotic genome, introns and exons are not separated, and the genome sequence is directly transcribed into mRNA to form proteins. In contrast, each cell in a eukaryotes has the same genome and different patterns of proteome. That is, the eukaryote undergoes post-transcriptional modifications such as introns that are separated from RNA sequences in time and space, and then modified mRNAs are generated. Such mRNA can be experimentally extracted in large quantities in the form of a complementary DNA (cDNA) library using reverse transcription, and a fragment of the cDNA sequence is called EST. Thus, studying ESTs in eukaryotes is an effective technique for identifying genes that function compared to studying the entire genome.

이와 같은 방법으로 양산되는 EST들은 실험자에 의해 저장되고 분석된다. 이 때, 분석되는 대상은 그 양이 상당히 많기 때문에 효과적인 EST 연구 결과물의 분석을 위해서는 연구 결과물에 대한 데이터베이스(database ; DB)화와 함께, 상기 데이터베이스와 기존의 서열 데이터간의 통합된 검색이 요구된다. 그러나, 기존의 서열 정보들은 생물의 종별, 조직별로 서로 다른 곳에서 분리되어 제공되고 있으며, 이들을 분석하는 도구들 역시 분석 목적에 따라 서로 다른 사이트(site)들에서 분리되어, 개발 및 유지되고 있다. 이와 같이, EST 데이터의 저장, 분석 및 검색 기능이 서로 다른 환경 하에서 개별적으로 동작되므로, 사용에 어려움이 있다.ESTs produced in this way are stored and analyzed by the experimenter. At this time, since the amount of the object to be analyzed is very large, an effective search of the results of the EST study requires an integrated search between the database and the existing sequence data in addition to the database (DB) of the study results. However, existing sequence information is provided separately from each other by species and tissues of organisms, and tools for analyzing them are also separated and developed and maintained at different sites according to an analysis purpose. As such, since the storage, analysis, and retrieval functions of the EST data are operated separately under different environments, there is a difficulty in using them.

본 발명이 이루고자 하는 기술적 과제는, 각각 떨어져 존재하고 있는 유전자 서열 데이터베이스와 서열 분석 도구들을 효율적으로 통합한 클라이언트/서버 기반 EST 분석 시스템을 제공하는데 있다.The technical problem to be achieved by the present invention is to provide a client / server-based EST analysis system that efficiently integrates the gene sequence database and sequencing tools that exist apart from each other.

본 발명이 이루고자 하는 또 하나의 기술적 과제는, 서열 ID 검색, 키워드 검색 및, 기능 카테고리 키워드 검색을 통해 얻은 서열 검색 결과 데이터로부터 유사성이 있는 후보 유전자 및 단백질 도메인 정보를 추출해 내고, 추출된 데이터를이용하여 EST 서열의 기능을 유추할 수 있는 클라이언트/서버 기반 EST 분석 시스템을 제공하는데 있다.Another technical task of the present invention is to extract candidate gene and protein domain information having similarity from sequence search result data obtained through sequence ID search, keyword search, and functional category keyword search, and use the extracted data. By providing a client / server based EST analysis system that can infer the function of the EST sequence.

본 발명이 이루고자 하는 또 하나의 기술적 과제는, 대량 데이터 분석의 결과를 효과적으로 볼 수 있는 전체 결과 보기 기능인 히트 리스트 및 히스토리 맵을 통해 대량 데이터 검증을 편리하게 하고, 대량 분석의 프로세스를 확인할 수 있는 시스템을 제공하는데 있다.Another technical problem to be achieved by the present invention is a system that facilitates mass data verification and checks the process of mass analysis through a hit list and a history map, which is an overall result viewing function for effectively viewing mass data analysis results. To provide.

본 발명이 이루고자 하는 또 하나의 기술적 과제는, 대량 데이터 관리를 사용자 프로젝트와 데이터베이스, 사용자 별로 할 수 있도록 프로젝트 관리, 데이터베이스 관리, 사용자 관리, 패스워드 관리로 이루어진 데이터 관리 시스템을 제공하는데 있다.Another technical problem to be achieved by the present invention is to provide a data management system consisting of project management, database management, user management, password management so that large-scale data management by user project and database, by user.

본 발명이 이루고자 하는 다른 기술적 과제는, 상기 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 있다.Another object of the present invention is to provide a computer-readable recording medium having recorded thereon a program for executing the method on a computer.

도 1은 본 발명의 바람직한 실시예에 따른 클라이언트/서버 기반 EST 분석 시스템의 블록도이다.1 is a block diagram of a client / server based EST analysis system in accordance with a preferred embodiment of the present invention.

도 2 및 도 3은 도 1에 도시된 EST 서열 및 분석 결과 데이터베이스에 저장되는 정보 및 상기 정보들간의 관계를 보여주는 블록도이다.2 and 3 are block diagrams showing the information stored in the EST sequence and analysis result database shown in FIG. 1 and the relationships between the information.

도 4는 도 1에 도시된 EST 서열 분석 서버에서 수행되는 EST 서열 분석 프로그램을 수행하는 클라이언트 인터페이스상의 메뉴, 및 그 하위 메뉴를 보여주는 도면이다.FIG. 4 is a diagram illustrating a menu on a client interface for executing an EST sequence analysis program executed in the EST sequence analysis server illustrated in FIG. 1, and submenus thereof.

도 5는 도 4에 도시된 데이터베이스 생성 메뉴가 선택되었을 때 실행되는 데이터베이스 생성 화면을 보여주는 도면이다.FIG. 5 is a diagram illustrating a database creation screen executed when the database creation menu illustrated in FIG. 4 is selected.

도 6은 도 4에 도시된 데이터 입력 메뉴가 선택되었을 때 실행되는 데이터 입력 화면이다.FIG. 6 is a data input screen executed when the data input menu shown in FIG. 4 is selected.

도 7은 도 4에 도시된 BLAST 검색 메뉴가 선택되었을 때 실행되는 BLAST 검색 화면을 보여주는 도면이다.FIG. 7 is a diagram illustrating a BLAST search screen executed when the BLAST search menu illustrated in FIG. 4 is selected.

도 8 및 도 9는 도 7에 의해 수행된 BLAST 검색 결과와, 그 것의 정렬(alignment) 결과를 각각 보여주는 도면이다.8 and 9 are diagrams showing the results of the BLAST search performed by FIG. 7 and the alignment results thereof.

도 10은 도 4에 도시된 번역(TRANSLATION) 메뉴가 선택되었을 때 수행되는 번역 결과를 보여주는 도면이다.FIG. 10 is a diagram illustrating a translation result performed when the translation menu illustrated in FIG. 4 is selected.

도 11 및 도 12는 도 4에 도시된 PROSITE 메뉴가 선택되었을 때 수행되는 PROSITE 검색의 결과 및 그것의 상세 정보를 보여주는 도면이다.11 and 12 are diagrams showing the results of the PROSITE search performed when the PROSITE menu shown in FIG. 4 is selected and detailed information thereof.

도 13은 도 4에 도시된 PRINTS 메뉴가 선택되었을 때 수행되는 PRINTS 검색의 결과를 보여주는 도면이다.FIG. 13 is a diagram illustrating a result of a PRINTS search performed when the PRINTS menu illustrated in FIG. 4 is selected.

도 14는 도 4에 도시된 RPS-BLAST 메뉴가 선택되었을 때 수행되는 RPS-BLAST 검색 화면을 보여주는 도면이다.FIG. 14 is a diagram illustrating an RPS-BLAST search screen performed when the RPS-BLAST menu shown in FIG. 4 is selected.

도 15 및 도 16은 도 14에 의해 수행된 RPS-BLAST 검색 결과 및 그 것의 정렬(alignment) 결과를 보여주는 도면이다.15 and 16 illustrate an RPS-BLAST search result and an alignment result thereof performed by FIG. 14.

도 17은 본 발명의 바람직한 실시예에 따른 EST 서열 분석 및 주석 데이터베이스 구축 방법을 보여주는 흐름도이다.17 is a flowchart showing a method for EST sequence analysis and annotation database construction according to a preferred embodiment of the present invention.

도 18 및 도 19는 도 4에 도시된 ID 검색 메뉴 하부 메뉴인 주석 데이터 검색 메뉴가 선택되었을 때 실행되는 주석 데이터 검색 화면 및 그것의 검색 결과를 각각 보여주는 도면이다.18 and 19 are views showing an annotation data search screen executed when the annotation data search menu, which is a submenu of ID search menu shown in FIG. 4, is selected and the search results thereof.

도 20 및 도 21은 도 4에 도시된 키워드 검색 메뉴가 선택되었을 때 실행되는 키워드 검색 화면 및 그것의 검색 결과를 보여주는 도면이다.20 and 21 are diagrams illustrating a keyword search screen executed when the keyword search menu shown in FIG. 4 is selected, and a search result thereof.

도 22 및 도 23은 도 4에 도시된 기능 카테고리 키워드 검색 메뉴가 선택되었을 때 실행되는 기능 카테고리 키워드 검색 화면 및 그것의 검색 결과를 보여주는 도면이다.22 and 23 are diagrams illustrating a function category keyword search screen executed when the function category keyword search menu shown in FIG. 4 is selected, and a search result thereof.

도 24는 도 4에 도시된 히트 리스트 검색 메뉴가 선택되었을 때 실행되는 히트 리스트 화면을 보여주는 도면이다.FIG. 24 illustrates a hit list screen executed when the hit list search menu illustrated in FIG. 4 is selected.

도 25는 도 4에 도시된 Remarkable Hit 검색 메뉴가 선택되었을 때 실행되는 Remarkable Hit 검색 화면을 보여주는 도면이다.FIG. 25 is a diagram illustrating a Remarkable Hit search screen executed when the Remarkable Hit search menu shown in FIG. 4 is selected.

도 26은 도 4에 도시된 히스토리 맵 메뉴가 선택되었을 때 실행되는 히스토리 맵 화면을 보여주는 도면이다.FIG. 26 illustrates a history map screen executed when the history map menu illustrated in FIG. 4 is selected.

도 27은 본 발명의 바람직한 실시예에 따른 서열 검색 방법을 보여주는 흐름도이다.27 is a flowchart showing a sequence retrieval method according to a preferred embodiment of the present invention.

상기의 과제를 이루기 위하여 본 발명에 의한 EST 분석 시스템은, 사용자로부터 실험에 의해 산출된 EST(Expressed Sequence Tag) 서열 데이터를 받아들이는 입출력 관리자; 상기 EST 서열 데이터를 소정의 포맷으로 변환하여 제 1 데이터베이스에 저장하는 서열 입력부; 상기 제 1 데이터베이스에 저장된 상기 EST 서열 데이터와, 검증된 다량의 유전자 및 단백질 서열 데이터가 저장된 제 2 데이터베이스에 저장된 서열 데이터간의 유사성 검색 및 단백질 도메인 검색을 수행하고, 상기검색 결과를 상기 제 1 데이터베이스에 저장하는 주석 분석부; 및 사용자로부터 입력된 검색 단서에 응답해서 상기 제 1 데이터베이스에 저장되어 있는 데이터에 대한 검색을 수행하는 분석결과 검색부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the EST analysis system according to the present invention comprises: an input / output manager for receiving Expressed Sequence Tag (EST) sequence data calculated by an experiment from a user; A sequence input unit which converts the EST sequence data into a predetermined format and stores it in a first database; Perform similarity search and protein domain search between the EST sequence data stored in the first database and the sequence data stored in a second database in which verified large amounts of gene and protein sequence data are stored, and the search results are stored in the first database. An annotation analysis unit for storing; And an analysis result search unit configured to perform a search on data stored in the first database in response to a search clue input from a user.

상기의 과제를 이루기 위하여 본 발명에 의한 EST 서열 분석 방법은, (a) 사용자로부터 실험에 의해 산출된 EST(Expressed Sequence Tag) 서열 데이터를 받아들이는 단계; (b) 상기 EST 서열 데이터를 소정의 포맷으로 변환하여 제 1 데이터베이스에 저장하는 단계; (c) 상기 제 1 데이터베이스에 저장된 상기 EST 서열 데이터와, 검증된 다량의 유전자 및 단백질 서열 데이터가 저장된 제 2 데이터베이스에 저장된 데이터간의 유사성 검색 및 단백질 도메인 검색을 수행하고, 상기 검색 결과를 상기 제 1 데이터베이스에 저장하는 단계; 및 (d) 사용자로부터 입력된 검색 단서에 응답해서 상기 제 1 데이터베이스에 저장되어 있는 데이터에 대한 검색을 수행하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the EST sequence analysis method according to the present invention includes the steps of: (a) receiving the EST (Expressed Sequence Tag) sequence data calculated by an experiment from a user; (b) converting the EST sequence data into a predetermined format and storing it in a first database; (c) perform a similarity search and a protein domain search between the EST sequence data stored in the first database and data stored in a second database in which verified large amounts of gene and protein sequence data are stored; Storing in a database; And (d) performing a search on the data stored in the first database in response to a search clue input from the user.

상기의 과제를 이루기 위하여 본 발명에 의한 EST 서열 분석 및 데이터베이스 구축 방법은, (a) 사용자로부터 실험에 의해 산출된 EST(Expressed Sequence Tag) 서열 데이터를 받아들여 제 1 데이터베이스를 구축하는 단계; (b) 상기 제 1 데이터베이스에 저장된 상기 EST 서열 데이터와, 검증된 다량의 유전자 및 단백질 서열 데이터가 저장된 제 2 데이터베이스에 저장된 데이터간의 유사성 검색 및 상기 EST 서열에 대한 단백질 도메인 검색을 수행하는 단계; (c) 상기 (b) 단계에서 수행된 상기 서열 유사성 검색 결과와 상기 단백질 도메인 검색 결과를 근거로 하여 상기 EST 서열이 상기 제 2 데이터베이스에 저장되어 있는 임의의 EST 서열과동정되었는지 여부를 판별하는 단계; (d) 상기 EST 서열이 동정된 경우, 상기 EST 서열에 대응되는 상기 제 2 데이터베이스의 유전자 내용 중 필요 정보를 분석하고, 상기 분석 결과를 상기 제 1 데이터베이스에 저장하는 단계; 및 (e) 상기 (a) 단계 및 상기 (b) 단계의 수행 여부를 상기 제 1 데이터베이스의 히스토리 테이블에 저장하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, an EST sequence analysis and database construction method according to the present invention comprises: (a) receiving a EST (Expressed Sequence Tag) sequence data generated by an experiment from a user to construct a first database; (b) performing a similarity search between the EST sequence data stored in the first database and data stored in a second database in which verified large amounts of gene and protein sequence data are stored and a protein domain search for the EST sequence; (c) determining whether the EST sequence has been identified with any EST sequence stored in the second database based on the sequence similarity search result and the protein domain search result performed in step (b). ; (d) analyzing the necessary information in the gene contents of the second database corresponding to the EST sequence when the EST sequence is identified, and storing the analysis result in the first database; And (e) storing whether the steps (a) and (b) are performed in a history table of the first database.

상기의 과제를 이루기 위하여 본 발명에 의한 EST 서열 검색 방법은, (a) 사용자로부터 입력된 검색 단서에 응답해서 EST 서열 분석 결과가 저장된 제 1 데이터베이스에 대해 ID 검색 및 키워드별 검색 중 어느 하나를 수행하고, 검색된 상기 EST 서열 데이터에 대응되는 유전자 정보 및 단백질 도메인 정보를 추출하는 단계; (b) Remarkable Hit 검색 및 기능 카테고리 키워드 검색 중 어느 하나를 고급 검색 방식으로 선택하는 단계; (c) 상기 (b) 단계에서 상기 고급 검색 방식으로 상기 Remarkable Hit 검색이 선택된 경우, 상기 (a) 단계에서 추출된 상기 결과 중에서 최상위 결과들을 추출하여 보여주는 단계; 및 (d) 상기 (b) 단계에서 상기 고급 검색 기능으로 상기 기능 카테고리 키워드 검색이 선택된 경우, 상기 (a) 단계에서 추출된 상기 결과가 속하는 카테고리에 대해 기능별 검색을 수행하고, 상기 검색 결과를 보여주는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the EST sequence retrieval method according to the present invention, in response to a search clue input from a user, performs one of an ID retrieval and a keyword-specific retrieval on a first database storing EST sequence analysis results. Extracting genetic information and protein domain information corresponding to the retrieved EST sequence data; (b) selecting one of a Remarkable Hit search and a function category keyword search by an advanced search method; (c) extracting and displaying a top result from the results extracted in step (a) when the remarkable hit search is selected as the advanced search method in step (b); And (d) when the function category keyword search is selected as the advanced search function in step (b), performing a function-specific search for the category to which the result extracted in step (a) belongs, and showing the search result. Characterized in that it comprises a step.

이하에서, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

도 1은 본 발명의 바람직한 실시예에 따른 클라이언트/서버 기반 EST 분석 시스템의 블록도이다. 도 1을 참조하면, 본 발명에 따른 EST 분석 시스템은, 네트워크(10)에 연결된 적어도 하나 이상의 클라이언트(100)와, EST 분석 서비스를 제공하는 EST 분석 서버(200)로 구성된다. EST 분석 서버(200)는 EST 분석부(210), EST 서열 및 분석결과 데이터베이스(220), 및 레퍼런스 데이터베이스(230)를 포함한다.1 is a block diagram of a client / server based EST analysis system in accordance with a preferred embodiment of the present invention. Referring to FIG. 1, an EST analysis system according to the present invention includes at least one client 100 connected to a network 10 and an EST analysis server 200 that provides an EST analysis service. The EST analysis server 200 includes an EST analysis unit 210, an EST sequence and analysis result database 220, and a reference database 230.

먼저, EST 서열 및 분석결과 데이터베이스(220)는 실험에 의해 산출된 EST 서열 데이터와, 서열 유사성 검색 결과 및 단백질 도메인 검색 결과를 저장하기 위한 데이터베이스로서, 실험에 의해 산출된 EST 서열 데이터를 저장하는 EST 서열 데이터베이스(221), 각각의 EST 서열 데이터에 대한 서열 유사성 검색 결과 및 단백질 도메인 검색 결과가 저장되는 주석 데이터베이스(222), 및 프로젝트 관리, 사용자 관리 및 데이터베이스 관리에 필요한 정보가 저장되는 관리 데이터베이스(223)로 구성된다.First, the EST sequence and analysis result database 220 is a database for storing the EST sequence data generated by the experiment, the sequence similarity search result, and the protein domain search result, and stores the EST sequence data generated by the experiment. A sequence database 221, an annotation database 222 storing sequence similarity search results and protein domain search results for each EST sequence data, and a management database 223 storing information necessary for project management, user management, and database management It consists of

레퍼런스 데이터베이스(230)는, 검증된 다량의 유전자 및 단백질 서열이 저장된 유전자 서열 데이터베이스로서, 레퍼런스 데이터베이스(230)는 크게 BLAST(Basic Local Alignment Search Tool) 검색용 데이터베이스(231)와, 도메인 검색용 데이터베이스(232)로 구분된다.The reference database 230 is a gene sequence database in which a large amount of verified gene and protein sequences are stored. The reference database 230 is largely a database for searching a basic local alignment search tool (BLAST) 231 and a database for searching a domain ( 232).

이 중 BLAST 검색용 데이터베이스(231)는 일정 형태(예를 들면, formatDB 형태)로 구성된 데이터베이스로서, 유전자 서열 정보와 유전자의 기능에 관한 정보를 가지는 UniGene, StackDB, RefSeq, TIGR 데이터베이스 등이 사용될 수 있으며, 입력 서열의 종에 맞는 BLAST 검색용 데이터베이스의 추가가 가능하다.The BLAST search database 231 is a database composed of a certain type (for example, formatDB type), and a UniGene, StackDB, RefSeq, TIGR database, etc. having gene sequence information and information on a gene's function may be used. In addition, it is possible to add a database for BLAST search that matches the species of the input sequence.

그리고, 도메인 검색용 데이터베이스(232)는 단백질의 각 도메인에 관한 정보가 저장된 데이터베이스로서, PROSITE(Database of protein families and domains) 데이터베이스(233), PRINTS(Protein Motif Fingerprint) 데이터베이스(234) 및 PFAM(Protein families)/SMART(Simple Modular Architecture Research Tool) 데이터베이스(235)로 구성된다. 여기서, PROSITE 데이터베이스(233)는, SIB(Swiss Institute of Bioinformatics)의 ExPASy WWW server에서 제공하는 단백질 데이터베이스이다. 이는 기존의 Swiss-Prot 데이터베이스에 있는 단백질 서열들로부터 생물학적으로 의미가 있는 패턴(pattern)들을 찾아내 패턴 별로 모아 만든 데이터베이스로서, 새로운 단백질 서열의 기능을 예상하는 데 사용된다. PRINTS 데이터베이스(234)는 OWL 단백질 데이터베이스에서 나온 'fingerprint' multiple alignment(순서가 있는 block들)에 대한 데이터베이스이며, PFAM/SMART 데이터베이스(235)는, Swiss-Prot/TrEMBL의 단백질 데이터베이스를 기반으로 만든 families, module 데이터베이스이며, seed alignment로 부터 얻어낸 HMM(hidden Markov model) PSSMs을 이용하여 새로운 단백질 서열을 정렬하고 기능을 확인하는 데 사용한다. 이 들 데이터베이스들(233, 234, 235)의 차이점은 데이터를 표현하는 양식, 즉 일반적인 텍스트(text) 패턴, 다중 배열(multiple alignment), 프로파일(profile), 또는 은닉 마코프 모델(hidden Markov models ; HMM) 등에서 나타난다.In addition, the domain search database 232 is a database storing information about each domain of the protein, and includes a database of protein families and domains (PROSITE) database 233, a protein motif fingerprint (PRINTS) database 234, and protein (PFAM). family (Simple Modular Architecture Research Tool) database (235). Here, the PROSITE database 233 is a protein database provided by the ExPASy WWW server of the Swiss Institute of Bioinformatics (SIB). It is a database of biologically meaningful patterns from the protein sequences in the existing Swiss-Prot database, collected by patterns, and used to predict the function of new protein sequences. The PRINTS database (234) is a database for 'fingerprint' multiple alignments (ordered blocks) from the OWL protein database, and the PFAM / SMART database (235) is based on a Swiss-Prot / TrEMBL protein database. A module database is used to align new protein sequences and verify their function using HMM (hidden Markov model) PSSMs obtained from seed alignment. The difference between these databases 233, 234, 235 is that the format in which the data is represented, i.e. a general text pattern, multiple alignment, profile, or hidden Markov models (HMM). ) And so on.

EST 분석부(210)는, 입출력 관리자(240), 서열 입력부(250), 서열 주석 분석부(260), 단백질 주석 분석부(270), 및 분석결과 검색부(280)로 구성된다. EST 분석부(210)는, 입출력 관리자(240)로부터 입력된 EST 서열 데이터에 대해서 유사성검색(즉, BLAST 검색) 및 단백질 도메인 검색을 수행하고, 상기 검색 결과를 근거로 하여 입출력 관리자(240)로부터 입력된 EST 서열 데이터가 동정되었는지 여부를 판별한다. 판별 결과, 해당 EST 서열 데이터가 동정된 경우, 검색에 의해 얻어진 유사 유전자의 정보를 분석(parsing)하고, 이를 EST 서열 및 분석결과 데이터베이스(220)에 저장한다. 그리고, 분석결과 검색부(280)를 통해 EST 서열 및 분석결과 데이터베이스(220)에 저장된 데이터를 검색한다. 이와 같은 동작을 수행하는 EST 분석부(210)의 상세 구성 및 동작은 다음과 같다.The EST analyzer 210 includes an input / output manager 240, a sequence input unit 250, a sequence annotation analyzer 260, a protein annotation analyzer 270, and an analysis result search unit 280. The EST analyzer 210 performs a similarity search (ie, a BLAST search) and a protein domain search on the EST sequence data input from the input / output manager 240 and from the input / output manager 240 based on the search result. It is determined whether the input EST sequence data has been identified. As a result of the determination, when the corresponding EST sequence data is identified, the information of the similar gene obtained by the search is parsed and stored in the EST sequence and analysis result database 220. Then, the data stored in the EST sequence and analysis result database 220 is searched through the analysis result search unit 280. Detailed configuration and operation of the EST analysis unit 210 for performing such an operation is as follows.

먼저, 입출력 관리자(240)는 EST 분석부(210)가 EST 분석 동작을 수행할 수 있도록 클라이언트(100)로부터 EST 서열 데이터를 받아들이고, EST 분석부(210)에서 수행된 EST 서열 분석 결과 과정을 상기 클라이언트(100)에게 출력한다. 그리고, EST 서열 분석 결과에 대한 검색에 필요한 검색 단서 등을 받아들이고, 검색 단서에 따른 EST 서열 분석 결과를 상기 클라이언트(100)에게 출력한다.First, the input / output manager 240 receives the EST sequence data from the client 100 so that the EST analysis unit 210 can perform the EST analysis operation, and the EST sequence analysis result process performed by the EST analysis unit 210 is described above. Output to the client 100. Then, the search clue necessary for the search for the EST sequence analysis result is accepted, and the result of the EST sequence analysis according to the search clue is output to the client 100.

입출력 관리자(240)를 통해 입력되는 EST 서열 데이터의 형식은 ABI(Application Binary Interface) 파일 형식과 FASTA 파일 형식을 지원한다. 이 중 FASTA 파일 형식은 생물정보분석 프로그램들의 입력 양식 중 가장 일반적으로 사용되는 파일 형식으로, 특정 서열(sequence)의 이름(또는 설명)과 서열 자체의 내용을 나타내는 데 사용된다.The format of the EST sequence data input through the input / output manager 240 supports the ABI (Application Binary Interface) file format and the FASTA file format. The FASTA file format is the most commonly used file format of bioinformatics programs. It is used to indicate the name (or description) of a particular sequence and the contents of the sequence itself.

서열 입력부(250)는 입출력 관리자(240)를 통해 입력되는 ABI 파일(251)을 읽어들여 서열로 변환하고, 변환된 서열 또는 FASTA 서열(252)을 소정의 데이터 형식으로 변환하여 EST 서열 데이터베이스(221)에 저장한다. 그리고, EST 서열들의중복성을 확인하고, 연결 가능한 EST 서열들을 PHRAP 프로그램을 이용하여 클러스터링한다. 이 외에도, 서열 입력부(250)는 입출력 관리자(240)를 통해 입력되는 프로젝트 관리 정보 및 사용자 관리 정보를 상기 관리 데이터베이스에 저장한다.The sequence input unit 250 reads the ABI file 251 inputted through the input / output manager 240 and converts the sequence into a sequence, converts the converted sequence or FASTA sequence 252 into a predetermined data format, and converts the EST sequence database 221. ). Then, the redundancy of the EST sequences is confirmed, and linkable EST sequences are clustered using the PHRAP program. In addition, the sequence input unit 250 stores project management information and user management information input through the input / output manager 240 in the management database.

서열 주석 분석부(260)는 EST 서열 데이터베이스(221)에 저장되어 있는 서열 정보(Seq)를 받아들이고, 상기 EST 서열에 대해 BLASTn과 BLASTx과 같은 유사성 검색을 수행한다.The sequence annotation analyzer 260 accepts the sequence information Seq stored in the EST sequence database 221 and performs a similarity search on the EST sequence such as BLASTn and BLASTx.

여기서, BLASTn 검색은 염기 서열간의 비교를 수행하는 BLAST 검색 기능이고, BLASTx 검색은 입력한 염기서열을 6개의 프레임(frame)으로 변환하여 이를 레퍼런스 데이터베이스(230)와 비교하는 BLAST 검색 기능이다. 서열 주석 분석부(260)는 상기와 같은 유사성 검색을 수행하여, 해당 EST 서열이 기존에 밝혀진 레퍼런스 데이터베이스(230)의 어떤 유전자 서열과 유사한지를 규명하고, 그 결과(gene)를 주석 데이터베이스(222)에 저장한다.Here, the BLASTn search is a BLAST search function that performs a comparison between the base sequences, and the BLASTx search is a BLAST search function that converts the input base sequence into six frames and compares it with the reference database 230. The sequence annotation analysis unit 260 performs the similarity search as described above, and identifies which gene sequence of the reference database 230 is known that the corresponding EST sequence is similar, and the result (gene) of the annotation database (222) Store in

단백질 주석 분석부(270)는 EST 서열 데이터베이스(221)에 저장되어 있는 서열 정보(Seq)를 받아들이고, 레퍼런스 데이터베이스(230)에 포함된 도메인 검색 데이터베이스(232)를 이용하여 상기 EST 서열에 대한 단백질 도메인 정규식 검색(PROSITE), 단백질 도메인 핑거프린트 패턴 검색(PRINTS), 및 단백질 프로파일 검색(PFAN/SMART)을 수행한다. 그리고, 해당 EST 서열에 대한 도메인 검색 결과(domain)를 주석 데이터베이스(222)에 저장한다.The protein annotation analyzer 270 accepts the sequence information Seq stored in the EST sequence database 221 and uses the domain search database 232 included in the reference database 230 for the protein domain of the EST sequence. Regular expression search (PROSITE), protein domain fingerprint pattern search (PRINTS), and protein profile search (PFAN / SMART). The domain search result (domain) for the corresponding EST sequence is stored in the annotation database 222.

이 때, EST 서열 및 분석결과 데이터베이스(220)는 데이터베이스 구축시 레퍼런스 데이터베이스(230)에 저장되어 있는 데이터 자체를 가져와서 저장하지 않고, 검색 결과를 분석하여 필요 정보만(예를 들면, 엔트리 번호, 유전자 제목, E-value(Expect value), 스코어(score) 등)을 추출하여 저장한다. 따라서, 데이터 저장 및 검색이 효율적으로 이루어지게 된다. 이와 같이 구성되는 EST 서열 및 분석결과 데이터베이스(220)의 데이터 모델은 도 2 및 도 3을 참조하여 상세히 설명될 것이다.At this time, the EST sequence and analysis result database 220 does not take and store the data itself stored in the reference database 230 at the time of constructing the database, and analyzes the search result to provide only necessary information (for example, entry number, Gene title, E-value (Expect value), score (score, etc.) is extracted and stored. Therefore, data storage and retrieval are efficiently performed. The data model of the EST sequence and analysis result database 220 configured as described above will be described in detail with reference to FIGS. 2 and 3.

분석 결과 검색부(280)는 입출력 관리자(240)로부터 입력된 검색 단서에 응답해서, 상기 검색 단서에 대응되는 EST 데이터를 EST 서열 및 분석결과 데이터베이스(220)에서 검색하고, 검색된 데이터를 입출력 관리자(240)를 통해 클라이언트(100) 측으로 전달한다.The analysis result search unit 280 retrieves the EST data corresponding to the search clue from the EST sequence and the analysis result database 220 in response to the search clue input from the input / output manager 240, and retrieves the retrieved data from the input / output manager ( Through the 240 to the client 100 side.

이와 같이, EST 분석 서버(200)의 EST 분석부(210)는 EST 분석 및 검색 질의에 대한 처리를 수행하고, 클라이언트(100)는 EST 분석 서버(200)에서 수행되는 EST 분석 진행 상황을 모니터링함과 동시에, 검색 결과의 전체 및 상세 부분을 네트워크(10)를 통해 제공받을 수 있게 된다. 이 때, 사용자는 대량의 EST 서열을 시스템에 입력한 후 필요한 검색을 GUI(Graphic User Interface)를 통해서 편리하고 간단하게 수행할 수 있을 뿐만 아니라, 검색 결과를 분석하기 쉽도록 그래픽을 이용한 화면으로 볼 수 있다.As such, the EST analysis unit 210 of the EST analysis server 200 performs processing on the EST analysis and search query, and the client 100 monitors the progress of the EST analysis performed by the EST analysis server 200. At the same time, the entire and detailed parts of the search result can be provided through the network 10. At this time, the user can input a large amount of EST sequences into the system and perform the necessary search conveniently and simply through the GUI (Graphic User Interface). Can be.

앞에서 설명한 바와 같이, 본 발명에 따른 EST 분석 시스템은 필요한 검색 대상 데이터베이스(즉, EST 서열 및 분석결과 데이터베이스(220))와 분석도구(즉, 레퍼런스 데이터베이스(230))를 서버(200)에 두고 GUI를 통해 클라이언트(100)에서 필요한 데이터베이스에 대한 검색을 선택적으로 실시한다. 그리고, 검색 결과는 관계형 데이터베이스(Relational Database ; RDB)화하여 서버에 저장하고, 분석시 이를 이용한다. 그리고, 관계된 모든 데이터를 지역화(localization)하여 연구 결과에 대한 보안을 강화한다. 이 경우, EST 분석에 필요한 데이터베이스와 검색 도구가 서버(200) 한 곳에만 설치되므로, 모든 분석은 서버(200) 내에서 이루어지게 되어 클라이언트(100)의 부담은 줄어들게 된다.As described above, the EST analysis system according to the present invention has a GUI for placing the necessary search target database (ie, EST sequence and analysis result database 220) and analysis tool (ie, reference database 230) on the server 200. Through the client 100 selectively performs a search for the required database. The search results are stored in a server by forming a relational database (RDB) and used for analysis. In addition, all relevant data is localized to enhance the security of the research results. In this case, since the database and search tool necessary for the EST analysis are installed only in one server 200, all the analysis is performed in the server 200, thereby reducing the burden on the client 100.

도 2 및 도 3은 도 1에 도시된 EST 서열 및 분석결과 데이터베이스(220)에 저장되는 정보 및 상기 정보들간의 관계를 보여주는 블록도이다. 도 2에는 EST 서열 데이터베이스(221) 및 주석 데이터베이스(222)에 대한 상세 구성이 도시되어 있고, 도 3에는 관리 데이터베이스(223)에 대한 상세 구성이 각각 도시되어 있다.2 and 3 are block diagrams showing the information stored in the EST sequence and analysis result database 220 shown in FIG. 1 and the relationship between the information. 2 shows a detailed configuration of the EST sequence database 221 and the annotation database 222, and a detailed configuration of the management database 223 is shown in FIG.

먼저 도 2를 참조하면, 실험에 의해 생성된 EST 서열 정보가 저장되는 EST 서열 데이터베이스(221)에는 ABI 서열 관리 정보 테이블(2211, 이하 ABI 테이블이라 칭함)과, 서열 클론 관리 정보 테이블(2212, 이하 CLONE 테이블이라 칭함)이 포함된다. ABI 테이블(2211)은, 서열 결정 실험에 의해 생성되는 원본 EST서열을 관리하는 ABI 서열 관련 정보가 저장되고, CLONE 테이블(2212)에는 서열을 어셈블리(assembly)한 후 콘티그(Contig)과 콘티그를 구성하는 리드(Reads)들과 싱글렛(Singlet)이 분리되어 저장된다. 콘티그는 중복된 EST와 연결 가능한 EST 리드들로 이루어진 어셈블리 후 결과 서열이다. ABI 테이블(2211)에 저장되는 정보는 아래의 [표 1]과 같다.First, referring to FIG. 2, the EST sequence database 221 in which the EST sequence information generated by the experiment is stored may include an ABI sequence management information table 2211 (hereinafter referred to as an ABI table) and a sequence clone management information table 2212 (hereinafter referred to as an ABI table). The CLONE table). The ABI table 2211 stores the ABI sequence related information that manages the original EST sequence generated by the sequencing experiment, and the CLONE table 2212 assembles the sequences and then contigs and contigs. The reads and the singlet constituting the read are separated and stored. Contigs are the post-assembly sequence consisting of EST reads linkable with overlapping ESTs. Information stored in the ABI table 2211 is shown in Table 1 below.

컬럼명Column name 타입명Type name 길이Length NullsNulls ABI_IDABI_ID VARCHARVARCHAR 4040 NONO SEQ_LENSEQ_LEN INTINT 1010 YESYES SEQSEQ TEXTTEXT YESYES CONTIG_NOCONTIG_NO VARCHARVARCHAR 2020 YESYES START_POSSTART_POS INTINT 1010 YESYES END_POSEND_POS INTINT 1010 YESYES TRIMTRIM ENUMENUM YESYES TRIMPOSTRIMPOS VARCHARVARCHAR 1212 YESYES

[표 1]을 참조하면, ABI 테이블(2211)에는 ABI 클론(clone) 번호(ABI_ID), 서열 길이(SEQ_LEN), 서열(SEQ), 콘티그 번호(CONTIG_NO), 콘티그를 구성하는 Reads의 시작 위치(START_POS), 콘티그를 구성하는 Reads의 끝 위치(END_POS), 서열 특성(Quality)이 낮은 부위를 잘라 냈는지의 여부(TRIM), 및 서열상의 TRIM 위치(TRIMPOS) 정보가 포함된다. 아래 [표 2]는 CLONE 테이블(2212)에 저장되는 정보를 나타낸다.Referring to Table 1, the ABI table 2211 includes an ABI clone number (ABI_ID), a sequence length (SEQ_LEN), a sequence (SEQ), a contig number (CONTIG_NO), and a start position of reads constituting a contig. (START_POS), the end position of the Reads constituting the contig (END_POS), whether or not the region having low sequence quality (TRIM) was cut out, and the TRIM position (TRIMPOS) information on the sequence. Table 2 below shows information stored in the CLONE table 2212.

컬럼명Column name 타입명Type name 길이Length NullsNulls NAMENAME VARCHARVARCHAR 2020 YESYES LIBRARYLIBRARY VARCHARVARCHAR 5050 YESYES DB_LINKDB_LINK VARCHARVARCHAR 1010 YESYES ORGANISMORGANISM VARCHARVARCHAR 5050 YESYES SEQ_LENSEQ_LEN INTINT 1010 YESYES SEQSEQ TEXTTEXT YESYES IDID VARCHARVARCHAR 4040 NONO KNOWNKNOWN SETSET YESYES REF_IDREF_ID VARCHARVARCHAR 2020 YESYES PATTERNPATTERN ENUMENUM YESYES PRINTSPRINTS ENUMENUM YESYES TRIMPOSTRIMPOS VARCHARVARCHAR 1212 YESYES CONTIGCONTIG ENUMENUM YESYES TRANS1TRANS1 TEXTTEXT YESYES TRANS2TRANS2 TEXTTEXT YESYES TRANS3TRANS3 TEXTTEXT YESYES TRANS4TRANS4 TEXTTEXT YESYES TRANS5TRANS5 TEXTTEXT YESYES TRANS6TRANS6 TEXTTEXT YESYES ESTSCANESTSCAN TEXTTEXT YESYES ABI_IDABI_ID VARCHARVARCHAR 4040 YESYES RPSBLASTRPSBLAST ENUMENUM YESYES

[표 2]를 참조하면, CLONE 테이블(2212)에는 EST 서열의 이름(NAME), EST의 소스 cDNA 라이브러리 정보(LIBRARY), 외부 연결 데이터베이스 리스트(DB_LINK), EST 서열의 생물 종(ORGANISM), 서열 길이(SEQ_LEN), 서열(SEQ), 서열의 고유 ID(ID), BLAST 검색시 적중된 데이터(HIT)가 존재하는 검색 데이터베이스의 명칭 리스트(KNOWN), 서열관련 문헌의 고유번호(REF_ID), PROSITE 검색 결과 유무(PATTERN), PRINTS 검색 결과 유무(PRINTS), TRIM 위치(TRIMPOS), CONTIG 여부(CONTIG), 제 1 프레임(frame 1)으로 번역(translation)된 아미노산 서열(TRANS1), 제2 프레임(frame 2)으로 번역된 아미노산 서열(TRANS2), 제 3 프레임(frame 3)으로 번역된 아미노산 서열(TRANS3), 제 4 프레임(frame 4)으로 번역된 아미노산 서열(TRANS4), 제 5 프레임(frame 5)으로 번역된 아미노산 서열(TRANS5), 제 6 프레임(frame 6)으로 번역된 아미노산 서열(TRANS6), EST Scan에 의해 얻어지는 아미노산 서열(ESTSCAN), ABI 테이블(2211)과의 연결 정보를 제공하는 해당 EST CLONE의 ABI 번호(ABI_ID), RPS-BLAST(Reversed Position Specific Blast) 검색 결과의 유무(RPS-BLAST)와 같은 정보를 포함한다. 여기서, RPS-BLAST는 BLAST 검색의 한 종류로서, 단백질 프로파일(profile) 정보를 이용해서 단백질 도메인 부분을 검색하는 프로그램이다.Referring to [Table 2], the CLONE table 2212 includes the name of the EST sequence (NAME), the source cDNA library information of the EST (LIBRARY), the external link database list (DB_LINK), the species of the EST sequence (ORGANISM), and the sequence. Length (SEQ_LEN), Sequence (SEQ), Sequence ID (ID), Name list (KNOWN) of the search database where the hit data (HIT) is present in the BLAST search, Sequence ID (REF_ID), PROSITE Search result (PATTERN), PRINTS search result (PRINTS), TRIM position (TRIMPOS), CONTIG (CONTIG), amino acid sequence (TRANS1) translated into the first frame (frame 1), second frame ( amino acid sequence (TRANS2) translated into frame 2), amino acid sequence (TRANS3) translated into third frame (frame 3), amino acid sequence translated into fourth frame (frame 4) (TRANS4), fifth frame (frame 5) ) Translated into amino acid sequence (TRANS5), amino acid sequence translated into frame 6 (TRANS6), EST Sca The amino acid sequence obtained by n (ESTSCAN), the ABI number (ABI_ID) of the corresponding EST CLONE providing linkage information with the ABI table 2211, and the presence or absence of a reversed position specific blast (RPS-BLAST) search result (RPS-BLAST) Contains information such as Here, RPS-BLAST is a type of BLAST search, and is a program for searching a protein domain part using protein profile information.

그리고, EST 서열 데이터에 대한 서열 유사성 검색 결과 및 단백질 도메인 검색 결과가 저장되는 주석 데이터베이스(222)에는, BLAST 관리 정보 테이블(2221, 이하 HIT_BLAST 테이블이라 칭함), 서열 정렬(alignment) 관리 정보 테이블(2222, 이하 ALIGN 테이블이라 칭함), PROSITE 관리 정보 테이블(2223, 이하 PROSITE 테이블이라 칭함), PRINTS 관리 정보 테이블(2226, 이하 PRINT 테이블이라 칭함), 및 레퍼런스 관리 정보 테이블(2227, 이하 REFERENCE 테이블이라 칭함)이 포함된다.The BLAST management information table 2221 (hereinafter referred to as the HIT_BLAST table) and the sequence alignment management information table 2222 are included in the annotation database 222 that stores the sequence similarity search result and the protein domain search result for the EST sequence data. , Hereinafter referred to as ALIGN table), PROSITE management information table (2223, hereinafter referred to as PROSITE table), PRINTS management information table (2226, hereinafter referred to as PRINT table), and reference management information table (2227, hereinafter referred to as REFERENCE table). This includes.

HIT_BLAST 테이블(2221)에는 사용자에 의해 지정된 E-value 이하의 BLAST 검색 결과(즉, 유사성 있는 EST 서열 데이터)가 저장된다. ALIGN 테이블(2222)에는 각각의 BLAST 검색 결과에 대한 개별 HSP(high-scoring segment pair) 정보와, 서열 정렬 결과가 저장된다. 여기서, 서열 정렬은 상동성(Homology)의 가능성과 유사성(similarity)의 정도를 평가하기 위해 최대의 동일성(identity)을 만들 수 있도록 두 개 이상의 서열을 일직선이 되게 정렬하여 하나의 서열로 만들어 가는 과정을 의미한다.The HIT_BLAST table 2221 stores BLAST search results (ie, similar EST sequence data) equal to or less than an E-value designated by the user. The ALIGN table 2222 stores individual high-scoring segment pair (HSP) information for each BLAST search result and sequence alignment result. Here, sequence alignment is a process in which two or more sequences are aligned in a straight line so as to create maximum identity in order to evaluate the degree of homology and the degree of similarity. Means.

PROSITE 테이블(2223)에는 패턴 관리 정보 테이블(2224, 이하 PATTERN 테이블이라 칭함) 및 MOTIF 관리 정보 테이블(2225, 이하 MOTIF 테이블이라 칭함)이 포함되어, 단백질 도메인 정규식 검색 결과를 저장한다. 이 중 PATTERN 테이블(2224)에는 정규식 검색 결과 얻어진 패턴 정보가 저장되고, MOTIF 테이블(2225)에는 정규식 검색 결과 얻어진 모티프(motif site) 정보가 저장된다. PRINT 테이블(2226)에는 단백질 도메인 핑거프린트 결과가 저장된다. 그리고, REFERENCE 테이블(2227)에는 CLONE 테이블(2212)에 저장된 데이터가 저장되어 있는 레퍼런스 데이터베이스의 관리 정보가 저장된다. HIT_BLAST 테이블(2221)에 저장되는 정보는 아래의 [표 3]과 같다.The PROSITE table 2223 includes a pattern management information table 2224 (hereinafter referred to as a PATTERN table) and a MOTIF management information table 2225 (hereinafter referred to as a MOTIF table) to store protein domain regular expression search results. The PATTERN table 2224 stores the pattern information obtained as a result of the regular expression search, and the MOTIF table 2225 stores the motif site information obtained as a result of the regular expression search. The protein domain fingerprint result is stored in the PRINT table 2226. The REFERENCE table 2227 stores management information of a reference database in which data stored in the CLONE table 2212 is stored. Information stored in the HIT_BLAST table 2221 is shown in Table 3 below.

컬럼명Column name 타입명Type name 길이Length NullsNulls ACC_IDACC_ID VARCHARVARCHAR 5050 NONO DESCRIPTIONDESCRIPTION TEXTTEXT YESYES ALIGN_NUMALIGN_NUM INTINT 1010 YESYES DBNAMEDBNAME VARCHARVARCHAR 2020 YESYES PROGRAMPROGRAM VARCHARVARCHAR 1010 YESYES CLONE_IDCLONE_ID VARCHARVARCHAR 4040 NONO GIGI VARCHARVARCHAR 250250 YESYES GBGB VARCHARVARCHAR 250250 YESYES EMBEMB VARCHARVARCHAR 250250 YESYES DBJDBJ VARCHARVARCHAR 250250 YESYES PIRPIR VARCHARVARCHAR 250250 YESYES PRFPRF VARCHARVARCHAR 250250 YESYES SPSP VARCHARVARCHAR 250250 YESYES PDBPDB VARCHARVARCHAR 250250 YESYES PATPAT VARCHARVARCHAR 250250 YESYES BBSBBS VARCHARVARCHAR 250250 YESYES GNLGNL VARCHARVARCHAR 250250 YESYES REFREF VARCHARVARCHAR 250250 YESYES LCLLCL VARCHARVARCHAR 250250 YESYES TISSUETISSUE VARCHARVARCHAR 1515 YESYES ORGANISMORGANISM VARCHARVARCHAR 5050 YESYES EVALUEEVALUE DOUBLEDOUBLE YESYES HITRPSHITRPS MEDIUMBLOBMEDIUMBLOB YESYES

[표 3]을 참조하면, HIT_BLAST 테이블(2221)에는 BLAST HIT의 고유 번호(ACC_ID), BLAST HIT의 제목(DESCRIPTION), HSP의 개수(ALIGN_NUM), BLAST 검색에 사용된 데이터베이스의 이름(DBNAME), BLAST 검색 프로그램의 종류(PROGRAM), EST 서열 ID(CLONE_ID), BLAST HIT과 연관된 GB(GenBank)의 서열 고유 번호(GI ; GenBank Identifier), GB의 서열 접근 번호(GB ; GenBank accession number), 유럽 서열 데이터베이스(European Molecular Biology Laboratory ; EMBL)의 고유 번호(EMB), 일본 서열 데이터베이스(DNA Data Bank of Japan ; DDBJ)의 고유번호(DBJ), PIR(Protein Information Resource)의 고유번호(PIR), PRF(Protein Research Foundation)의 이름, SWISS-PROT의 고유 번호(SP), PDB(Brookhaven Protein Data Bank)의 고유 번호(PDB), 서열의 PAT(Patent) 번호(PAT), BBS(GenInfo Backbone)의 고유번호(BBS), GNL(General database)의 고유번호(GNL), 문헌 고유 번호(REF), LCL(Local Sequence)의 고유 번호(LCL), 조직 이름(TISSUE), 생물종 명(ORGANISM), BLAST HIT 의 E-value 값(EVALUE), RPS-BLAST(Reversed Position Specific BLAST) 검색 결과 얻어진 정렬 정보(HITRPS)가 포함된다. [표 4]는 ALIGN 테이블(2222)에 저장되는 데이터를 나타낸다.Referring to [Table 3], the HIT_BLAST table 2221 includes a unique number (ACC_ID) of a BLAST HIT, a title (DESCRIPTION) of a BLAST HIT, a number of HSPs (ALIGN_NUM), a name of a database used for a BLAST search (DBNAME), Type of BLAST search program (PROGRAM), EST sequence ID (CLONE_ID), sequence unique number (GI) of the GenBank associated with BLAST HIT (GI; GenBank Identifier), GB sequence access number (GB; GenBank accession number), European sequence Unique number (EMB) of database (European Molecular Biology Laboratory (EMBL)), unique number (DBJ) of DNA Data Bank of Japan (DDBJ), unique number (PIR) of Protein Information Resource (PIR), PRF ( Protein Research Foundation's name, SWISS-PROT's unique number (SP), PDB (Brookhaven Protein Data Bank) unique number (PDB), sequence PAT (Patent) number (PAT), BBS (GenInfo Backbone) unique number (BBS), GNL (General database) unique number (GNL), document unique number (REF), LCL (Local Sequence) unique number (LCL ), Tissue name (TISSUE), species name (ORGANISM), E-value value (EVALUE) of BLAST HIT, and sorted information (HITRPS) obtained from reversed position specific BLAST (RPS-BLAST) search. Table 4 shows data stored in the ALIGN table 2222.

컬럼명Column name 타입명Type name 길이Length NullsNulls SUBJECT_LENSUBJECT_LEN INTINT 1010 YESYES EVALUEEVALUE DOUBLEDOUBLE YESYES SCORESCORE FLOATFLOAT YESYES QUERY_STARTQUERY_START INTINT 1010 YESYES SUBJECT_STARTSUBJECT_START INTINT 1010 YESYES FRAMEFRAME ENUMENUM NONO IDENTITYIDENTITY INTINT 55 YESYES POSITIVEPOSITIVE INTINT 55 YESYES NGAPNGAP INTINT 55 YESYES QUEQUE BLOBBlob YESYES MATMAT BLOBBlob YESYES SBJSBJ BLOBBlob YESYES HSPHSP INTINT YESYES ACC_IDACC_ID VARCHARVARCHAR 5050 YESYES ALIGN_IDALIGN_ID VARCHARVARCHAR 6060 NONO STRANDSTRAND VARCHARVARCHAR 2020 YESYES

[표 4]를 참조하면, ALIGN 테이블(2222)에는 BLAST HSP의 길이(SUBJECT_LEN), HSP ALIGNMENT의 E-value 값(EVALUE), HSP ALIGNMENT의 SCORE 값(SCORE), 쿼리(QUERY)의 시작 위치(QUERY_START), HSP ALIGNMENT의 시작 위치(SUBJECT_START), BLASTx 검색 결과의 frame 값(FRAME), HSP의 IDENTITY 개수(IDENTITY), HSP의 POSITIVE 개수(POSITIVE), HSP의 gap 개수(NGAP), HSP의 ALIGNMENT된 부분의 Query 서열(QUE), HSP의 ALIGNMENT MATCH 내용(MAT), HSP의 ALIGNMENT된 부분의 SUBJECT 서열(SBJ), HSP 길이(HSP), BLAST HIT 고유번호(ACC_ID), BLAST ALIGNMENT 고유번호(ALIGN_ID), BLASTN 결과의 ALIGNMENTSTRAND(STRAND)가 포함된다. [표 5]는 단백질 도메인 정규식 검색 결과 얻어진 패턴 정보를 저장하는 PATTERN 테이블(2224)에 저장되는 데이터를 나타낸다.Referring to [Table 4], the ALIGN table 2222 includes the length of the BLAST HSP (SUBJECT_LEN), the E-value value (EVALUE) of the HSP ALIGNMENT, the SCORE value of the HSP ALIGNMENT (SCORE), and the starting position of the query (QUERY). QUERY_START), the starting position of the HSP ALIGNMENT (SUBJECT_START), the frame value (FRAME) of the BLASTx search result, the IDENTITY number of the HSP, the POSITIVE number of the HSP, the number of gaps of the HSP (NGAP), and the ALIGNMENT of the HSP. Query sequence of the part (QUE), ALIGNMENT MATCH content of the HSP (MAT), SUBJECT sequence of the ALIGNMENTed part of the HSP (SBJ), HSP length (HSP), BLAST HIT unique number (ACC_ID), BLAST ALIGNMENT unique number (ALIGN_ID) , ALIGNMENTSTRAND (STRAND) of the BLASTN result. Table 5 shows data stored in the PATTERN table 2224 which stores the pattern information obtained as a result of the protein domain regular expression search.

컬럼명Column name 타입명Type name 길이Length NullsNulls PATTERN_IDPATTERN_ID VARCHARVARCHAR 6060 NONO NAMENAME VARCHARVARCHAR 5050 YESYES DESCRIPTIONDESCRIPTION TEXTTEXT YESYES MATCH_NOMATCH_NO INTINT 55 YESYES SEQ_LENSEQ_LEN INTINT 1010 YESYES CLONE_IDCLONE_ID VARCHARVARCHAR 4040 NONO FULLFULL MEDIUMTEXTMEDIUMTEXT YESYES

[표 5]를 참조하면, PATTERN 테이블(2224)에는 정규식 패턴의 고유번호(PATTERN), PROSITE 이름(NAME), PROSITE 상세 설명(DESCRIPTION), PROSITE 패턴에 포함된 모티프 개수(MATCH_NO), 검색한 단백질 서열 길이(SEQ_LEN), EST 서열 고유 번호(CLONE_ID), 및 검색한 단백질이 가지는 PROSITE 패턴 전체의 상세 설명(FULL)이 포함된다. [표 6]은 정규식 검색 결과 하나에 포함된 복수 개의 모티프 정보를 저장하는 MOTIF 테이블(2225)에 저장되는 데이터를 나타낸다.Referring to [Table 5], the PATTERN table 2224 includes a regular expression pattern unique number (PATTERN), a PROSITE name (NAME), a PROSITE detailed description (DESCRIPTION), the number of motifs included in the PROSITE pattern (MATCH_NO), and a searched protein. Sequence length (SEQ_LEN), EST sequence unique number (CLONE_ID), and full description (FULL) of the entire PROSITE pattern of the searched protein are included. Table 6 shows data stored in the MOTIF table 2225 for storing a plurality of motif information included in one regular expression search result.

컬럼명Column name 타입명Type name 길이Length NullsNulls MOTIF_IDMOTIF_ID VARCHARVARCHAR 7070 NONO START_MATCHSTART_MATCH INTINT 1010 YESYES END_MATCHEND_MATCH INTINT 1010 YESYES MOTIF_LENMOTIF_LEN INTINT 1010 YESYES MOTIF_CONMOTIF_CON TINYTEXTTINYTEXT YESYES PATTERN_IDPATTERN_ID VARCHARVARCHAR 6060 NONO

[표 6]을 참조하면, MOTIF 테이블(2225)에는 모티프 고유 번호(MOTIF_ID), 모티프 시작 위치(START_MATCH), 모티프 끝 위치(END_MATCH), 모티프 길이(MOTIF_LEN), 모티프 내용(MOTIF_CON), 및 정규식 패턴의고유번호(PATTERN_ID)가 저장된다. [표 7]은 단백질 도메인 핑거프린트 결과가 저장되는 PRINTS 테이블(2226)에 저장되는 데이터를 나타낸다.Referring to [Table 6], the MOTIF table 2225 includes a motif unique number (MOTIF_ID), motif start position (START_MATCH), motif end position (END_MATCH), motif length (MOTIF_LEN), motif content (MOTIF_CON), and regular expression pattern. The unique number (PATTERN_ID) is stored. Table 7 shows data stored in the PRINTS table 2226 where the protein domain fingerprint results are stored.

컬럼명Column name 타입명Type name 길이Length NullsNulls PRINT_IDPRINT_ID VARCHARVARCHAR 6060 NONO PRINT_ACCPRINT_ACC VARCHARVARCHAR 1010 YESYES PRINT_NAMEPRINT_NAME VARCHARVARCHAR 5050 YESYES PRINT_DESCPRINT_DESC VARCHARVARCHAR 250250 YESYES FULLFULL MEDIUMTEXTMEDIUMTEXT YESYES CLONE_IDCLONE_ID VARCHARVARCHAR 4040 NONO

[표 7]을 참조하면, PRINTS 테이블(2226)에는 단백질 도메인 핑거프린트 고유번호(PRINT_ID), 핑거프린트 접근 번호(PRINT_ACC), 핑거프린트 이름(PRINT_NAME), 핑거프린트 클래스별 상세 설명(PRINT_DESC), 핑거프린트 전체의 상세 설명(FULL), 및 EST 서열 고유 번호(CLONE_ID)가 저장된다.Referring to [Table 7], the PRINTS table 2226 includes a protein domain fingerprint identification number (PRINT_ID), a fingerprint access number (PRINT_ACC), a fingerprint name (PRINT_NAME), a detailed description of fingerprint classes (PRINT_DESC), and a finger. The full description of the print (FULL) and the EST sequence unique number (CLONE_ID) are stored.

이 외에도, REFERENCE 테이블(2227)에는 EST 서열 데이터베이스(221), 주석 데이터베이스(222), 및 관리 데이터베이스(223)와 연결되는 레퍼런스 데이터베이스(230) 관련 데이터(예를 들면, 서열관련 문헌의 고유번호(REF_ID), 유전자 제목(TITLE) 등)가 저장된다.In addition, the REFERENCE table 2227 includes a reference database 230 associated with the EST sequence database 221, an annotation database 222, and a management database 223 (eg, a unique number of a sequence-related document). REF_ID), gene title (TITLE, etc.) are stored.

도 3을 참조하면, 관리 데이터베이스(223)에는 프로젝트 관리 정보 테이블(2231, 이하 PROJECT 테이블이라 칭함), 사용자 관리 정보 테이블(2232, 이하 USER 테이블이라 칭함), 및 히스토리 관리 정보 테이블(2233, 이하 HISTORY 테이블이라 칭함)이 포함된다. PROJECT 테이블(2231)에는 프로젝트 관리 정보가 저장되고, USER 테이블(2232)에는 프로젝트 이름(PROJECTNAME)이 저장되고, 각 프로젝트별 사용자 정보와 히스토리 정보가 USER 테이블(2232)과 HISTORY 테이블(2233)에각각 저장된다. USER 테이블(2232)에는 사용자 아이디(USERID), 각 사용자별 패스워드(PASSWD), 프로젝트 이름(PROJECTNAME), 및 데이터베이스 접근 허용 정보(PERMISSION)가 저장된다. 그리고, HISTORY 테이블(2233)에는 사용자가 생성한 데이터베이스의 이름(DBNAME), 프로젝트 이름(PROJECTNAME), 데이터 입력 여부(DBINPUT), BLAST 검색 여부(BLAST), 번역 여부(TRANSLATION), PROSITE 검색 여부(PROSITE), PRINTS 검색 여부(PRINTS), RPS-BLAST 검색 여부(RPSBLAST), 기능 카테고리 파일 생성 여부(CATEGORY), 및 Remarkable hit 파일 생성 여부(REMARK)가 저장된다. 이와 같이, 본 발명에 따른 EST 분석 서버(200)는, 프로젝트 별 관리와, 프로젝트 내 연구자별 관리를 별도의 데이터베이스를 구성하여 관리하게 된다.Referring to FIG. 3, the management database 223 includes a project management information table (2231, hereinafter referred to as PROJECT table), a user management information table (2232, hereinafter referred to as USER table), and a history management information table (2233, hereinafter referred to as HISTORY). A table). The project management information is stored in the PROJECT table 2223, the project name PROJECTNAME is stored in the USER table 2232, and the user information and history information for each project are stored in the USER table 2232 and the HISTORY table 2233, respectively. Stored. The USER table 2232 stores a user ID USERID, a password for each user PASSWD, a project name PROJECTNAME, and database access permission information PERMISSION. In addition, the HISTORY table 2233 includes a name (DBNAME), a project name (PROJECTNAME), a data input (DBINPUT), a BLAST search (BLAST), a translation (TRANSLATION), and a PROSITE search (PROSITE). ), Whether to search for PRINTS (PRINTS), whether to search for RPS-BLAST (RPSBLAST), whether to generate a function category file (CATEGORY), and whether to generate a Remarkable hit file (REMARK). As described above, the EST analysis server 200 according to the present invention configures and manages a separate database for management for each project and management for each researcher in the project.

도 4는 도 1에 도시된 EST 서열 분석 서버(200)에서 수행되는 EST 서열 분석 프로그램을 수행하는 클라이언트 인터페이스상의 메뉴, 및 그 하위 메뉴를 보여주는 도면이다. 도 4를 참조하면, 본 발명에 따른 EST 서열 분석 서버(200)에서 수행되는 기능은 크게 서열/단백질 주석 분석 기능, 검색 기능, 전체 결과 보기 기능 및 데이터 관리 기능으로 구분된다.4 is a diagram illustrating a menu on a client interface for executing an EST sequence analysis program executed in the EST sequence analysis server 200 illustrated in FIG. 1, and submenus thereof. Referring to FIG. 4, the functions performed in the EST sequence analysis server 200 according to the present invention are largely divided into a sequence / protein annotation analysis function, a search function, an overall result viewing function, and a data management function.

주석 분석 기능을 수행하기 위해서는, 먼저 EST 서열 데이터가 저장될 EST 서열 데이터베이스(221)를 생성하고, 서열 결정법에 의해 얻어진 서열 데이터를 입력하게 된다. 주석 분석 기능의 수행에 대한 상세 내용은 다음과 같다.In order to perform the annotation analysis function, first, an EST sequence database 221 in which EST sequence data is stored is generated, and sequence data obtained by sequencing is input. Details on performing annotation analysis are as follows.

도 5는 도 4에 도시된 데이터베이스 생성 메뉴가 선택되었을 때 실행되는 데이터베이스 생성 화면을 보여주는 도면이고, 도 6은 도 4에 도시된 데이터 입력 메뉴가 선택되었을 때 실행되는 데이터 입력 화면이다. 도 5를 참조하면, 데이터베이스 생성 화면에서 사용자가 원하는 이름을 입력하게 되면, 사용자가 지정한 이름을 가지는 EST 서열 데이터베이스(221)가 생성된다. EST 서열 데이터베이스(221)가 생성되면, 사용자는 도 6에 도시된 바와 같이, 이미 생성되어 있는 복수 개의 EST 서열 데이터베이스들 중 데이터가 입력될 데이터베이스를 선택하고, 선택된 데이터베이스에 ABI 파일 또는 FASTA 파일을 입력하게 된다. 사용자로부터 ABI 파일 또는 FASTA 파일 입력되면, EST 분석부(210)에 구비된 서열 입력부(250)는 해당 파일을 소정의 형식으로 변환하고, 이를 EST 서열 데이터베이스(221)에 저장한다.5 is a diagram illustrating a database creation screen executed when the database creation menu illustrated in FIG. 4 is selected, and FIG. 6 is a data input screen executed when the data input menu illustrated in FIG. 4 is selected. Referring to FIG. 5, when a user inputs a desired name on the database creation screen, an EST sequence database 221 having a name designated by the user is generated. When the EST sequence database 221 is created, the user selects a database into which data is to be input from among a plurality of EST sequence databases already generated, and inputs an ABI file or a FASTA file into the selected database, as shown in FIG. 6. Done. When an ABI file or a FASTA file is input from the user, the sequence input unit 250 provided in the EST analyzer 210 converts the file into a predetermined format and stores the file in the EST sequence database 221.

다시 도 4를 참조하면, 본 발명에 따른 EST 서열 분석 서버(200)는 EST 서열 데이터베이스(221)에 저장된 서열을 하나씩 차례로 가져와서 서열 유사성(BLAST) 검색 및 번역을 수행하고, 단백질 도메인 정규식 검색(PROSITE), 단백질 도메인 핑거프린트 검색(PRINTS), 및 단백질 도메인 프로파일 검색(RPS-BLAST)을 수행한다. 이 같은 EST 서열 분석 및 단백질 도메인 분석은 분석 모듈별로 구성되어 사용자가 원하는 분석을 선별적으로 수행할 수 있도록 한다. 이에 대한 상세 내용은 다음과 같다.Referring back to FIG. 4, the EST sequence analysis server 200 according to the present invention takes sequences stored in the EST sequence database 221 one by one, performs sequence similarity (BLAST) search and translation, and searches for a protein domain regular expression ( PROSITE), protein domain fingerprint search (PRINTS), and protein domain profile search (RPS-BLAST). Such EST sequence analysis and protein domain analysis are organized by analysis module so that the user can selectively perform the desired analysis. Details of this are as follows.

도 7은 도 4에 도시된 BLAST 검색 메뉴가 선택되었을 때 실행되는 BLAST 검색 화면을 보여주는 도면이고, 도 8 및 도 9는 도 7에 의해 수행된 BLAST 검색 결과와, 그것의 정렬(alignment) 결과를 각각 보여주는 도면이다. 도 7 내지 도 9를 참조하면, 사용자가 도 7의 화면에서 검색에 사용될 EST 서열 데이터베이스(221), BLAST 검색 데이터베이스의 종류 및 E-value를 선택하게 되면, EST 서열 분석 서버(200)는 BLAST 데이터베이스(231)를 이용한 BLAST 검색을 수행하여, 도 8 및도 9와 같은 BLAST 검색 결과를 얻게 된다. 여기서, 도 8은 E-value가 높은 순서대로 정렬된 검색 결과로서, 매칭된 곳을 그래픽을 이용해 보여준다. 이 때, 사용자가 임의의 데이터를 클릭하게되면, 해당 데이터에 대한 상세 정보가 검색되어 도 9와 같이 보여지게 된다.FIG. 7 is a diagram illustrating a BLAST search screen executed when the BLAST search menu illustrated in FIG. 4 is selected, and FIGS. 8 and 9 illustrate a BLAST search result performed by FIG. 7 and an alignment result thereof. Each figure shows. 7 to 9, when the user selects the EST sequence database 221 to be used for the search, the type of BLAST search database and the E-value on the screen of FIG. 7, the EST sequence analysis server 200 determines the BLAST database. A BLAST search using 231 is performed to obtain a BLAST search result as shown in FIGS. 8 and 9. Here, FIG. 8 is a search result in which the E-values are sorted in ascending order, and shows the matched place graphically. At this time, when the user clicks on arbitrary data, detailed information about the corresponding data is retrieved and shown as shown in FIG. 9.

도 10은 도 4에 도시된 번역(TRANSLATION) 메뉴가 선택되었을 때 수행되는 번역 결과를 보여주는 도면이다. 사용자가 입력한 서열 데이터는 도 10과 같이 소정의 형식으로 번역되고, 번역된 각각의 데이터는 아래와 같은 단백질 도메인 검색에 의해 그 특성이 분석된다.FIG. 10 is a diagram illustrating a translation result performed when the translation menu illustrated in FIG. 4 is selected. The sequence data input by the user is translated into a predetermined format as shown in FIG. 10, and each of the translated data is analyzed by the protein domain search as follows.

도 11 및 도 12는 도 4에 도시된 PROSITE 메뉴가 선택되었을 때 수행되는 PROSITE 검색의 결과 및 그것의 상세 정보를 보여주는 도면이고, 도 13은 도 4에 도시된 PRINTS 메뉴가 선택되었을 때 수행되는 PRINTS 검색의 결과를 보여주는 도면이다. 도 11을 참조하면, 도 4에서 PROSITE 메뉴가 선택되는 경우, 유사성 있는 유전자로 판명된 데이터에 대한 단백질 도메인 정규식 검색이 수행되어, 도 11과 같은 결과를 얻게 된다. PROSITE 검색 결과는 EST 서열의 번역 프레임별로 표시될 수도 있고, 도메인 매치 부위를 그래픽 형식으로 보여줄 수도 있다. 그 결과, PROSITE 검색을 통해 새로운 단백질 서열의 기능을 예상할 수 있게 된다. 이어서, 도 4에서 PRINTS 메뉴가 선택되는 경우, 유사성 있는 유전자로 판명된 데이터에 대한 단백질 도메인 핑거프린트 검색이 수행되어, 도 13과 같이 번역 프레임에 대한 PRINTS 검색 결과를 클래스(class)별로 나열하여 보여주게 된다. 이 때, 사용자가 검색 결과 얻어진 임의의 데이터를 클릭하게 되면, 해당 데이터에 대한 상세 정보가 검색되어 보여주게 된다.11 and 12 are diagrams showing the results of a PROSITE search performed when the PROSITE menu shown in FIG. 4 is selected and detailed information thereof, and FIG. 13 is a PRINTS performed when the PRINTS menu shown in FIG. 4 is selected. A diagram showing the results of a search. Referring to FIG. 11, when the PROSITE menu is selected in FIG. 4, a protein domain regular expression search is performed on data identified as similar genes, thereby obtaining a result as illustrated in FIG. 11. The results of the PROSITE search may be displayed per translation frame of the EST sequence, or may be graphically depicted for domain match sites. As a result, the function of the new protein sequence can be predicted through the PROSITE search. Subsequently, when the PRINTS menu is selected in FIG. 4, the protein domain fingerprint search is performed on the data found to be similar genes. As shown in FIG. 13, the PRINTS search results for the translation frame are listed by class. Given. At this time, when the user clicks on any data obtained as a result of the search, detailed information about the data is retrieved and displayed.

도 14는 도 4에 도시된 RPS-BLAST 메뉴가 선택되었을 때 수행되는 RPS-BLAST 검색 화면을 보여주는 도면이고, 도 15 및 도 16은 도 14에 의해 수행된 RPS-BLAST 검색 결과 및 그것의 정렬(alignment) 결과를 각각 보여주는 도면이다. RPS-BLAST 검색 결과는 EST 서열의 번역 프레임별로 보여주게 되는데, 이와 같은 RPS-BLAST 검색에 의해서 단백질 도메인 프로파일이 검색될 수 있게 된다.FIG. 14 is a view showing an RPS-BLAST search screen performed when the RPS-BLAST menu shown in FIG. 4 is selected, and FIGS. 15 and 16 show an RPS-BLAST search result performed by FIG. 14 and its alignment ( alignment) results respectively. RPS-BLAST search results are shown by the translation frame of the EST sequence, the protein domain profile can be searched by this RPS-BLAST search.

이상과 같은 EST 서열에 대한 서열 유사성 검색과 단백질 도메인 검색이 모두 수행되고 나면, EST 분석 서버(200)는 임의의 EST 서열에 대한 서열 유사성 검색 결과와 단백질 도메인 검색 결과를 근거로 하여, 상기 EST 서열이 레퍼런스 데이터베이스(230)의 어느 EST 서열에 동정되었는지를 판별한다. 판별 결과, EST 서열이 동정되었으면, 상기 EST 서열에 대응되는 레퍼런스 데이터베이스(230)의 유전자 내용 중 필요 정보를 분석하고, 이를 EST 서열 및 분석결과 데이터베이스(220)의 주석 데이터베이스(222)에 저장한다. 이상과 같이 도 4에 도시된 EST 분석 서버(200)에서 수행되는 주석 분석 기능을 정리하면 다음과 같다.After both the sequence similarity search and the protein domain search for the EST sequence described above are performed, the EST analysis server 200 based on the sequence similarity search result and the protein domain search result for any EST sequence, and thus, the EST sequence. The EST sequence of the reference database 230 is determined. As a result of the determination, when the EST sequence is identified, necessary information is analyzed in the gene contents of the reference database 230 corresponding to the EST sequence, and the information is stored in the annotation database 222 of the EST sequence and the analysis result database 220. A summary of the annotation analysis function performed by the EST analysis server 200 shown in FIG. 4 is as follows.

도 17은 본 발명의 바람직한 실시예에 따른 EST 서열 분석 및 주석 데이터베이스(222)의 구축 방법을 보여주는 흐름도이다. 도 17을 참조하면, 본 발명에 따른 EST 서열 분석 및 주석 데이터베이스(222)의 구축은 크게 서열 유사성 검색과 단백질 도메인 검색에 의해 이루어진다.17 is a flow chart showing a method for constructing an EST sequence analysis and annotation database 222 in accordance with a preferred embodiment of the present invention. Referring to FIG. 17, the construction of the EST sequence analysis and annotation database 222 according to the present invention is largely performed by sequence similarity search and protein domain search.

서열 유사성 검색을 통해 주석 데이터베이스(222)를 구축하기 위해서는, 먼저 EST 서열 데이터베이스(221)를 생성한 후 서열 결정법에 의해 얻어진 서열 데이터를 입력함으로써 EST 서열 데이터베이스(221)를 구축한다(2500 단계). 이어서, EST 서열 데이터베이스(221)에 저장되어 있는 서열을 하나씩 차례로 가져온 후, 레퍼런스 데이터베이스(230)를 이용하여 서열 유사성 검색(BLAST)을 수행한다(2600단계). 그리고, 임의의 유전자 서열에 대한 서열 유사성 검색 결과를 근거로 하여 상기 EST 서열이 레퍼런스 데이터베이스(230)에 저장되어 있는 유전자 서열과 동정되었는지 여부를 판별한다(2900단계). 2900 단계에서의 판별 결과, 상기 EST 서열이 레퍼런스 데이터베이스(230)에 저장되어 있는 유전자 서열과 동정되었으면, 상기 EST 서열과 동정된 서열 유사성 검색 결과를 분석한 후(3100 단계), 분석 결과를 주석 데이터베이스(222)에 저장한다.In order to construct the annotation database 222 through the sequence similarity search, the EST sequence database 221 is constructed by first generating the EST sequence database 221 and inputting sequence data obtained by the sequencing method (step 2500). Subsequently, the sequences stored in the EST sequence database 221 are taken one by one, and then sequence similarity search (BLAST) is performed using the reference database 230 (step 2600). Then, based on the sequence similarity search result for any gene sequence, it is determined whether the EST sequence is identified with the gene sequence stored in the reference database 230 (step 2900). As a result of the determination in step 2900, if the EST sequence is identified with the gene sequence stored in the reference database 230, after analyzing the sequence similarity search result identified with the EST sequence (step 3100), the analysis result is annotated database Stored at 222.

그리고, 단백질 도메인 검색을 통해 주석 데이터베이스(222)를 구축하기 위해서는, 먼저 EST 서열 데이터베이스(221)에 저장되어 있는 서열들이 번역되고 나면, 번역 결과를 EST 서열 데이터베이스(221)에 저장한다(2700 단계). 2700 단계에서 번역이 수행되고 나면 단백질 도메인 검색이 수행되는데(2710 단계), 단백질 도메인 검색은 크게 단백질 도메인 정규식 검색(PROSITE, 2720 단계), 단백질 도메인 핑거프린트 검색(PRINTS, 2730 단계), 및 단백질 도메인 프로파일 검색(RPS-BLAST, 2740 단계)으로 구성된다.In order to construct the annotation database 222 through protein domain search, first, after the sequences stored in the EST sequence database 221 are translated, the translation result is stored in the EST sequence database 221 (step 2700). . After the translation is performed in step 2700, the protein domain search is performed (step 2710). The protein domain search is largely performed by the protein domain regular expression search (PROSITE, step 2720), the protein domain fingerprint search (PRINTS, step 2730), and the protein domain. Profile search (RPS-BLAST, step 2740).

2710 단계에서 단백질 도메인 검색이 수행되고 나면, 수행된 단백질 도메인 검색 결과를 근거로 하여 해당 EST 서열이 레퍼런스 데이터베이스(230)에 저장되어 있는 단백질 도메인 서열과 동정되었는지 여부가 판별된다(3000 단계). 3000 단계에서의 판별 결과, EST 서열이 동정되었으면, 상기 EST 서열과 동정된 단백질 도메인 서열 검색 결과를 분석하고(3100 단계), 그 결과를 주석 데이터베이스(222)에 저장한다(3300 단계).After the protein domain search is performed in step 2710, it is determined whether the corresponding EST sequence is identified with the protein domain sequence stored in the reference database 230 based on the result of the protein domain search performed (step 3000). As a result of the determination in step 3000, if the EST sequence has been identified, the protein domain sequence search result identified with the EST sequence is analyzed (step 3100), and the result is stored in the annotation database 222 (step 3300).

이와 같은 방법에 의해 구축된 EST 서열 데이터베이스(221) 및 주석 데이터베이스(222)는, EST 서열 데이터의 검색시 그 정보를 이용해 사용자에게 보여주며 상세 정보를 위한 웹 링크를 제공하는 특징을 가진다. 이들 데이터베이스(221, 222)를 이용한 검색은 다음과 같다.The EST sequence database 221 and the annotation database 222 constructed by such a method are characterized by showing the user by using the information when retrieving the EST sequence data and providing a web link for detailed information. The search using these databases 221 and 222 is as follows.

도 4를 다시 참조하면, 본 발명에 따른 EST 분석 클라이언트(100)는 EST 분석 서버(200)에서 제공하는 EST 서열 및 분석결과 데이터베이스(220)에 대한 다각적인 검색을 수행하기 위해 ID 검색 메뉴, 키워드 검색 메뉴, 기능 카테고리 키워드 검색 메뉴, 및 Remarkable Hit 검색 메뉴를 구비한다.Referring back to FIG. 4, the EST analysis client 100 according to the present invention performs an ID search menu and a keyword to perform a multi-faceted search on the EST sequence and analysis result database 220 provided by the EST analysis server 200. A search menu, a function category keyword search menu, and a Remarkable Hit search menu.

도 18 내지 도 23은 도 4에 도시된 검색 메뉴별로 수행되는 EST 서열 데이터베이스(221) 및 주석 데이터베이스(222)에 대한 검색 화면 및 검색 결과를 보여주는 도면이다. 먼저, 도 18 및 도 19에는 도 4에 도시된 ID 검색 메뉴 하부 메뉴인 주석 데이터 검색 메뉴가 선택되었을 때 실행되는 주석 데이터 검색 화면 및 그것의 검색 결과가 각각 도시되어 있다. 도 18에서 사용자가 검색될 데이터베이스를 선택하고, 검색하고자 하는 EST서열의 ID를 입력하게 되면, 도 19에 도시된 바와 같이 서열 유사성 검색 결과를 포함한, 보다 상세한 서열 정보가 검색되어 보여지게 된다.18 to 23 are diagrams illustrating a search screen and a search result of the EST sequence database 221 and the annotation database 222 performed for each search menu shown in FIG. 4. First, FIGS. 18 and 19 show an annotation data search screen executed when the annotation data search menu, which is a submenu of the ID search menu shown in FIG. 4, is selected and the search results thereof are respectively shown. In FIG. 18, when a user selects a database to be searched and inputs an ID of an EST sequence to be searched, more detailed sequence information including a sequence similarity search result is searched and shown as shown in FIG. 19.

도 20 및 도 21은 도 4에 도시된 키워드 검색 메뉴가 선택되었을 때 실행되는 키워드 검색 화면 및 그것의 검색 결과를 보여주는 도면이다. 도 20에서 사용자가 검색될 데이터베이스를 선택하고, 유전자 제목 등과 같은 키워드와 E-value를 입력하게 되면, 도 21에 도시된 바와 같이 해당 EST서열 ID를 중심으로 유사성 있는 유전자들의 고유 ID와 유전자 제목 등이 검색되어 보여지게 된다. 이 때, 유전자의 고유 ID를 클릭하게 되면, 해당 유전자에 대한 서열 정보를 볼 수 있게 된다.20 and 21 are diagrams illustrating a keyword search screen executed when the keyword search menu shown in FIG. 4 is selected, and a search result thereof. In FIG. 20, when a user selects a database to be searched and inputs a keyword such as a gene title and an E-value, as shown in FIG. 21, unique IDs and gene titles of similar genes based on the corresponding EST sequence ID are shown. Is searched and shown. At this time, if you click the unique ID of the gene, you can see the sequence information for the gene.

도 22 및 도 23은 도 4에 도시된 기능 카테고리 키워드 검색 메뉴가 선택되었을 때 실행되는 기능 카테고리 키워드 검색 화면 및 그것의 검색 결과를 보여주는 도면이다. 도 22에서 사용자가 검색될 데이터베이스를 선택한 후 검색을 수행하게 되면, 도 23에 도시된 바와 같이 각각의 기능 카테고리별로 유전자가 검색되어 보여지게 된다. 이 같은 기능 카테고리 키워드 검색은 고급 검색 기능 중 하나로서, 상기 유전자가 속한 카테고리를 기능별로 검색하여 기능별 유전자 정보를 추출한다.22 and 23 are diagrams illustrating a function category keyword search screen executed when the function category keyword search menu shown in FIG. 4 is selected, and a search result thereof. In FIG. 22, when a user selects a database to be searched and performs a search, as shown in FIG. 23, genes are searched and displayed for each functional category. The function category keyword search is one of the advanced search functions, and the gene information for each function is extracted by searching a category to which the gene belongs.

도 24는 도 4에 도시된 Remarkable Hit 검색 메뉴가 선택되었을 때 실행되는 Remarkable Hit 검색 화면을 보여주는 도면이다. 도 25를 참조하면, Remarkable Hit 검색 기능은 검색된 결과 중 최상위의 결과만을 검색하여 보여주는 기능으로서, 도 4에 도시된 기능 카테고리 키워드 검색과 함께 고급 검색 기능에 속한다.FIG. 24 is a diagram illustrating a Remarkable Hit search screen executed when the Remarkable Hit search menu shown in FIG. 4 is selected. Referring to FIG. 25, the remarkable hit search function is a function of searching and showing only the top result of the searched results, and belongs to the advanced search function together with the function category keyword search shown in FIG. 4.

앞에서 설명한 바와 같이, 본 발명에 따른 EST 분석 서버(200)는 사용자가 선택한 검색 방법에 따라서 다양한 방식으로 분석 데이터 검색을 수행하며, EST 분석 클라이언트(100)는 다양한 방법으로 수행된 검색 결과를 화면상에 보여주게 된다. 이를 위해 본 발명에 따른 EST 분석 클라이언트(100)는 히트 리스트(Hit List) 보기, 및 히스토리 맵(History Map) 보기와 같은 다양한 전체 결과 보기 메뉴를 제공한다.As described above, the EST analysis server 200 according to the present invention performs analysis data search in various ways according to the search method selected by the user, and the EST analysis client 100 displays the search results performed by various methods on the screen. Is shown in. To this end, the EST analysis client 100 according to the present invention provides various overall result view menus such as a hit list view and a history map view.

도 25는 도 4에 도시된 히트 리스트 검색 메뉴가 선택되었을 때 실행되는 히트 리스트 화면을 보여주는 도면이다. 도 25를 참조하면, EST에 대한 검색 결과는, 표 형태의 히트 리스트로 구성되어 보여지게 된다. 여기서, 히트 리스트의 가로축은 레퍼런스 데이터베이스의 종류를 나타내고, 히트 리스트의 세로축은 EST 서열 ID를 각각 나타낸다. 그리고, 각 검색 결과는 검색을 수행한 데이터베이스의 히트 유무에 따라 체크 표시로 나타낸다.FIG. 25 illustrates a hit list screen executed when the hit list search menu illustrated in FIG. 4 is selected. Referring to FIG. 25, the search results for the EST are shown in a tabular hit list. Here, the horizontal axis of the hit list indicates the type of the reference database, and the vertical axis of the hit list indicates the EST sequence ID, respectively. Each search result is indicated by a check mark depending on whether or not the database for which the search has been performed.

도 26은 도 4에 도시된 히스토리 맵 메뉴가 선택되었을 때 실행되는 히스토리 맵 화면을 보여주는 도면이다. 도 26을 참조하면, 히스토리 맵 기능은 모든 분석 과정을 하나의 맵으로 나타내고, 분석이 완료된 과정을 차별화 하여 보여주는 기능으로서, 사용자가 입력한 서열 데이터베이스별로 전체 EST 서열에 대한 분석 진행 상태가 표시되며, 검색이 완료된 것은 화면에 표시되는 색을 달리하여(예를 들면, 파란색) 표현된다. 이는 각각의 분석 작업이 완료된 후 관리 데이터베이스(223)의 히스토리 테이블에 저장된 수행 정보를 이용해서 생성된 도면으로, 사용자는 이 기능을 통해서 이미 수행된 EST 분석 상태를 확인하고, 분석의 계속 여부를 판단 할 수 있게 된다. 이상과 같은 본 발명에 따른 EST 분석 서버(200)의 검색 기능 및 결과 보기 기능을 정리하면 다음과 같다.FIG. 26 illustrates a history map screen executed when the history map menu illustrated in FIG. 4 is selected. Referring to FIG. 26, the history map function is a function of displaying all analysis processes as one map and differentiating the completed analysis process. The analysis progress status of the entire EST sequence is displayed for each sequence database input by the user. The completion of the search is represented by different colors (for example, blue) displayed on the screen. This is a drawing generated using the performance information stored in the history table of the management database 223 after each analysis is completed, the user can check the status of the EST analysis already performed through this function, and determine whether to continue the analysis You can do it. The search function and the result viewing function of the EST analysis server 200 according to the present invention as described above are as follows.

도 27은 본 발명의 바람직한 실시예에 따른 서열 검색 방법을 보여주는 흐름도이다. 도 27을 참조하면, 본 발명에 따른 서열 검색 방법은 먼저 검색 방법을 선택한다(2810 단계). 2810 단계에 의해 구분되는 서열 검색 방법은 크게 일반 검색과, 고급 검색으로 구분된다. 일반 검색은 다시 ID 검색과 키워드 검색으로 구분되고, 고급 검색은 Remarkable Hit 검색과 기능 카테고리 키워드 검색으로 구분된다.27 is a flowchart showing a sequence retrieval method according to a preferred embodiment of the present invention. Referring to Figure 27, the sequence search method according to the present invention first selects a search method (step 2810). The sequence search method divided by step 2810 is largely divided into general search and advanced search. General search is divided into ID search and keyword search, and advanced search is divided into Remarkable Hit search and function category keyword search.

예를 들어, 2810 단계에서 검색 방법으로 ID 또는 키워드 검색이 선택된 경우, 먼저 사용자로부터 검색에 사용될 유전자 ID 또는 키워드 검색어가 입력된다(2820 단계). 그리고, 2820 단계에서 입력된 검색어에 응답해서 EST 서열 데이터베이스(221) 및 주석 데이터베이스(222)로부터 EST 서열 및 주석 데이터가 검색되고(2830 단계), 검색된 EST 서열 데이터에 대응되는 유전자에 대한 상세 정보 및 단백질 도메인 정보가 추출된다(2840 단계). 이와 같은 과정에 의해 추출된 정보는 클라이언트(100)측에 검색 결과로서 보여지게 된다.(2880 단계)For example, when ID or keyword search is selected as a search method in step 2810, first, a gene ID or keyword search term to be used for search is input from the user (step 2820). The EST sequence and annotation data are retrieved from the EST sequence database 221 and the annotation database 222 (step 2830) in response to the search word input in operation 2820, and detailed information about the gene corresponding to the retrieved EST sequence data is obtained. Protein domain information is extracted (step 2840). The information extracted by this process is shown as a search result to the client 100 (step 2880).

그리고, 2810 단계에서 검색 방법으로 고급 검색이 선택된 경우, 고급 검색은 다시 Remarkable Hit 검색과 기능 카테고리 키워드 검색으로 구분된다(2850 단계). 2850 단계에서 Remarkable Hit 검색 기능이 선택되면, 주석 데이터베이스(222)로부터 최상위의 검색 결과들을 추출해서 Remarkable Hit 파일을 생성한다(2860 단계). 2860 단계에서 생성된 Remarkable Hit 파일은 클라이언트(100)측에 고급 검색 결과로서 보여지게 된다(2880 단계). 그리고, 2850 단계에서 기능 카테고리 키워드 검색이 선택되면, 주석 데이터베이스(222)로부터 각 기능별 카테고리에 대한 검색을 수행하게 된다(2870 단계). 2870 단계에서 검색된 자료는 클라이언트(100)측에 고급 검색 결과로서 보여지게 된다(2880 단계).If advanced search is selected as the search method in step 2810, the advanced search is further classified into a remarkable hit search and a function category keyword search (step 2850). If the Remarkable Hit search function is selected in step 2850, the search results of the highest level are extracted from the annotation database 222 to generate a Remarkable Hit file (step 2860). The Remarkable Hit file generated in step 2860 is shown to the client 100 as an advanced search result (step 2880). If a function category keyword search is selected in step 2850, a search for each function category is performed from the annotation database 222 (step 2870). The searched data in step 2870 is shown to the client 100 as an advanced search result (step 2880).

앞에서 설명한 바와 같이, 본 발명에 따른 EST서열 분석 시스템(200)은 생물 종별, 조직별 레퍼런스 유전자 데이터베이스를 사용한 EST 서열 유사성 검색, 단백질 도메인 정규식 검색, 도메인 핑거프린트 검색, 및 도메인 프로파일 검색을 통해서 EST 서열 및 분석결과 데이터베이스를 구축하고, 구축된 EST 서열 및 분석결과 데이터베이스에 대한 다각적인 검색을 통해서 유사성 있는 유전자의 상세 정보를 추출할 수 있고 도메인 관련 유전자 기능을 예측할 수 있다.As described above, the EST sequence analysis system 200 according to the present invention is an EST sequence through EST sequence similarity search, protein domain regular expression search, domain fingerprint search, and domain profile search using a biological species and tissue-specific reference gene database. And it is possible to establish the analysis result database, extract detailed information of similar genes and predict domain related gene function through multiple search of the constructed EST sequence and analysis result database.

이상에서, 본 발명의 실시예로서 유전자 서열 정보와 유전자의 기능에 관한 정보를 가지는 UniGene, StackDB, RefSeq, TIGR 데이터베이스를 통한 EST 서열 분석 및 검색이 가능한 클라이언트/서버 기반 EST 서열 분석 시스템에 대해 구체적으로 예시되었으나, 그밖에도 다양한 레퍼런스 유전자 및 단백질 도메인 데이터베이스들이 본 발명에 적용될 수 있으며, 웹 상에서는 물론 온라인 및 오프라인 상에서도 본 발명이 적용될 수 있다.In the above, as an embodiment of the present invention, a client / server based EST sequence analysis system capable of EST sequence analysis and retrieval through UniGene, StackDB, RefSeq, and TIGR databases having gene sequence information and gene function information is specifically described. Although illustrated, various other reference gene and protein domain databases can be applied to the present invention, and the present invention can be applied both online and offline.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드로 저장되고 실행될 수 있다.The invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, and may also be implemented in the form of a carrier wave (for example, transmission over the Internet). Include. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상에 설명한 바와 같이, 본 발명에 따른 클라이언트/서버 기반 EST 분석시스템에 의하면, 대량으로 양산되는 EST 서열 결과물에 대한 데이터베이스화와 기존의 데이터베이스의 정보 비교, 및 검색 결과의 저장과 검색 기능이 효율적으로 통합될 수 있다. 따라서, 실험 EST 데이터의 데이터베이스화와 EST 서열 검색 결과의 분석이 통합된 분석 솔루션들에 의해 용이하게 수행될 수 있으며, 통합된 결과를 하나의 인터페이스를 통해 종합적으로 추출해 비교 분석 할 수 있다.As described above, according to the client / server-based EST analysis system according to the present invention, the database for the mass production of EST sequence results, the comparison of information in the existing database, and the storage and retrieval function of the search results can be efficiently Can be integrated. Therefore, the database of experimental EST data and the analysis of EST sequence search results can be easily performed by integrated analysis solutions, and the integrated results can be comprehensively extracted and analyzed through one interface.

Claims

An input / output manager that receives an EST (Expressed Sequence Tag) sequence data calculated by an experiment from a user;

A sequence input unit which converts the EST sequence data into a predetermined format and stores it in a first database;

Perform a similarity search and a protein domain search between the EST sequence data stored in the first database and the sequence data stored in a second database in which verified large amounts of gene and protein sequence data are stored, and the search results are stored in the first database. An annotation analysis unit for storing; And

And an analysis result search unit configured to perform a search on data stored in the first database in response to a search clue input from a user.

The method of claim 1,

The input / output manager receives project and user management information, the EST sequence information, and the search clue from at least one client connected through a network, and provides the search result performed by the analysis result search unit to the client through the network. EST sequence analysis system.

The method of claim 2, wherein the first database is

An EST sequence database for storing the EST sequence data input from a user;

An annotation database for storing the similarity search and the domain search results performed by the annotation analyzer; And

And a management database for storing the project management information and the user management information related to the EST analysis.

The method of claim 1, wherein the second database is

A BLAST database for performing the similarity search; And

EST sequence analysis system comprising a domain search database for performing the protein domain search.

The method of claim 4, wherein the domain search database is

A first domain search database for performing a protein domain regular expression search for the EST sequence;

A second domain search database for performing a protein domain fingerprint pattern search for the EST sequence; And

And a third domain search database for performing a protein profile search for said EST sequence.

The method of claim 1,

And the EST sequence data input through the input / output manager comprises an ABI format file and a FASTA format file.

The method according to claim 3 or 6, wherein

The sequence input unit reads the ABI file, converts the sequence into a sequence, converts the converted sequence or the FASTA sequence into a predetermined data format, and stores the converted sequence information in the EST sequence database, and stores the project management information and the user management information. EST sequence analysis system, characterized in that stored in the management database.

The method of claim 4, wherein the annotation analysis unit

Performing a BLAST search on the EST sequence data using the BLAST database to identify which gene sequence is similar to that stored in the BLAST database, and storing the identified results in the annotation database; Annotation analyzer; And

A second annotation analyzer configured to perform a protein domain regular expression search, a protein domain fingerprint pattern search, and a protein profile search for the EST sequence data using the domain search database, and store the search results in the annotation database; EST sequence analysis system, characterized in that.

The method of claim 1,

In response to the search clue input from the user, the analysis result search unit performs one of a general search including an ID search and a keyword search, and an advanced search including a function category keyword search and a Remarkable Hit search. EST sequence analysis system characterized in that the display of the search results in either of the form of hit list and history map.

The method of claim 9,

The Remarkable Hit search EST sequence analysis method characterized in that for extracting the top result from the general search results.

The method of claim 9,

The history map is EST sequence analysis system, characterized in that indicating the progress of the analysis of the entire EST sequence.

(a) accepting Expressed Sequence Tag (EST) sequence data produced by an experiment from a user;

(b) converting the EST sequence data into a predetermined format and storing it in a first database;

(c) perform a similarity search and a protein domain search between the EST sequence data stored in the first database and data stored in a second database in which verified large amounts of gene and protein sequence data are stored; Storing in a database; And

and (d) performing a search for data stored in the first database in response to a search clue input from a user.

The method of claim 12,

In step (a), the project and user management information, the EST sequence information and the search clue from the at least one client connected via a network, EST sequence analysis method.

The method of claim 13, wherein the first database is

An EST sequence database for storing the EST sequence data input from a user;

The method of claim 12, wherein the second database is

A BLAST database for performing the similarity search; And

EST sequence analysis method comprising a domain search database for performing the protein domain search.

16. The system of claim 15, wherein the domain search database is

EST sequence analysis method comprising a third domain search database for performing a protein profile search for the EST sequence.

The method of claim 12,

The EST sequence data input in the step (a), EST sequence analysis method characterized in that it comprises an ABI format file and FASTA format file.

The method according to claim 14 or 17,

In step (a), the ABI file is read and converted into a sequence, and the converted sequence or the FASTA sequence is converted into a predetermined data format and stored in the EST sequence database.

The method of claim 14,

The step (a) of the EST sequence analysis method further comprises the step of storing the project management information and the user management information in the management database.

The method of claim 15, wherein step (c)

(c-1) performing a BLAST search on the EST sequence data using the BLAST database to determine which gene sequence is similar to the EST sequence stored in the BLAST database, and the result of the identification to the annotation database Storing in; And

(c-2) performing a protein domain regular expression search, a protein domain fingerprint pattern search, and a protein profile search on the EST sequence data using the domain search database, and storing the search results in the annotation database; EST sequence analysis method comprising the.

The method of claim 12, wherein step (d)

(d-1) selecting whether to perform a general search or an advanced search on the data stored in the first database;

(d-2) when a general search is selected in the step (d-1), performing one of an ID search and a keyword search in response to the search clue input from a user;

(d-3) when the advanced search is selected in the step (d-1), performing one of a function category keyword search and a remarkable hit search in response to the search clue input from the user; And

(d-4) EST sequence analysis method comprising the step of displaying the search results performed in the step (d-2) or (d-3) in the form of a hit list and history map form .

The method of claim 21,

The Remarkable Hit search EST sequence analysis method characterized in that for extracting the most significant result from the search results performed in step (d-2).

The method of claim 21,

The history map is EST sequence analysis method characterized in that the analysis of the progress of the entire EST sequence.

(a) receiving an EST sequence data generated by an experiment from a user and constructing a first database;

(b) performing a similarity search between the EST sequence data stored in the first database and data stored in a second database in which verified large amounts of gene and protein sequence data are stored and a protein domain search for the EST sequence;

(c) determining whether the EST sequence is identified with any gene sequence stored in the second database based on the sequence similarity search result and the protein domain search result performed in step (b). ;

(d) analyzing the necessary information in the gene contents of the second database corresponding to the EST sequence when the EST sequence is identified, and storing the analysis result in the first database; And

(e) storing whether the steps (a) and (b) are performed in a history table of the first database.

The method of claim 24, wherein step (b) comprises:

(b-1) performing a protein domain regular expression search on the EST sequence;

(b-2) performing a protein domain fingerprint search on the EST sequence; And

(b-3) EST sequence analysis and database construction method comprising the step of performing a protein domain profile search for the EST sequence.

(a) In response to a search clue input from a user, one of an ID search and a keyword-specific search is performed on the first database in which the EST sequence analysis result is stored, and the genetic information and protein domain information corresponding to the retrieved EST sequence data are performed. Extracting;

(b) selecting one of the Remarkable Hit function and the function category keyword search function by an advanced search method;

(c) extracting and displaying a top result from the results extracted in step (a) when the remarkable hit function is selected as the advanced search method in step (b); And

(d) If the function category keyword search function is selected as the advanced search function in step (b), perform a function-specific search for the category to which the result extracted in step (a) belongs, and display the search result. EST sequence search method comprising the step of.

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 12 to 26 on a computer.