KR102110017B1

KR102110017B1 - miRNA ANALYSIS SYSTEM BASED ON DISTRIBUTED PROCESSING

Info

Publication number: KR102110017B1
Application number: KR1020170170284A
Authority: KR
Inventors: 류성호; 배윤위; 최철원
Original assignee: 순천향대학교 산학협력단
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2020-05-12
Also published as: WO2019117353A1; KR20190069922A

Abstract

분산 처리에 기반한 miRNA 분석 시스템은 NGS(Next Generation Sequencing)에 기반하여 분석된 원시 데이터를 전처리하여 타겟 서열 정보를 전송하는 클라이언트 장치, miRNA 식별자와 상기 식별자로 식별되는 miRNA의 서열을 포함하는 miRNA 데이터베이스, miRNA 식별자와 상기 식별자와 매칭되는 타겟 유전자 정보를 포함하는 타겟 유전자 데이터베이스 및 수신한 상기 타겟 서열 정보를 기준으로 상기 miRNA 데이터베이스에서 매칭되는 타겟 miRNA를 추출하고, 상기 타겟 miRNA를 기준으로 상기 타겟 유전자 데이터베이스에서 매칭되는 타겟 유전자 정보를 추출하는 분석 서버를 포함한다. 상기 타겟 서열 정보는 NGS 분석 데이터에서 서열을 중복 횟수를 기준으로 선정된다.The miRNA analysis system based on variance processing is a client device that preprocesses raw data analyzed based on Next Generation Sequencing (NGS) to transmit target sequence information, a miRNA database including a miRNA identifier and a sequence of miRNA identified by the identifier, A target gene database including a miRNA identifier and target gene information matching the identifier and a target miRNA matched in the miRNA database are extracted based on the target sequence information received, and the target gene database is extracted based on the target miRNA. And an analysis server that extracts matching target gene information. The target sequence information is selected based on the number of overlapping sequences in NGS analysis data.

Description

MiRNA analysis system based on dispersion processing {miRNA ANALYSIS SYSTEM BASED ON DISTRIBUTED PROCESSING}

이하 설명하는 기술은 NGS 분석 데이터를 기준으로 miRNA 분석을 수행하는 시스템에 관한 것이다.The technique described below relates to a system for performing miRNA analysis based on NGS analysis data.

차세대 염기서열 분석 기술(next generation sequencing, NGS)은 염기서열 해독(sequencing)에 걸리는 시간과 비용을 획기적으로 줄였다. NGS로 생산된 데이터는 테라 바이트(terabyte, 10¹² bytes) 크기 수준인데, 컴퓨터 기술의 발전으로 인해 저비용 고효율로 데이터 저장이 가능하게 되었다. NGS 데이터 분석은 생물정보학의 새로운 연구 분야로 자리매김하였다.Next generation sequencing (NGS) dramatically reduces the time and cost of sequencing. The data produced by NGS is on the order of terabyte (10 ¹² bytes) in size. Due to the advancement of computer technology, it is possible to store data with low cost and high efficiency. NGS data analysis has established itself as a new field of research in bioinformatics.

한편 microRNA (miRNA)는 21-25 nucleotide (nt)의 RNA 분자로서 mRNA의 번역을 억제하여 진핵 생물의 유전자 발현을 직접 제어하는 역할을 한다. miRNA는 식물과 동물 모두에서 잘 보존되며, 다수의 mRAN 기작을 제어하는 것으로 밝혀지고 있다. 한편 miRNA의 위와 같은 기능을 이해하기 위해서는 miRNA와 반응하는 타겟 유전자(target mRNA)를 찾는 것이 매우 중요하다.On the other hand, microRNA (miRNA) is a 21-25 nucleotide (nt) RNA molecule that inhibits mRNA translation and directly controls gene expression in eukaryotic organisms. miRNA is well preserved in both plants and animals, and has been found to control multiple mRAN mechanisms. Meanwhile, in order to understand the above functions of miRNA, it is very important to find a target gene (target mRNA) that reacts with miRNA.

미국공개특허 US 2017-0177792US Patent Publication US 2017-0177792

NGS로 생성된 데이터는 대용량 데이터이다. 단일 서버는 복수의 대용량 NGS 데이터를 병렬적으로 처리하기가 어렵다. 이하 설명하는 기술은 대용량 NGS 데이터를 분산 처리하는 분석 시스템을 제공하고자 한다. The data generated by NGS is large data. It is difficult for a single server to process multiple large amounts of NGS data in parallel. The technique described below is intended to provide an analysis system for distributing large-capacity NGS data.

분산 처리에 기반한 miRNA 분석 시스템은 NGS(Next Generation Sequencing)에 기반하여 분석된 원시 데이터를 전처리하여 타겟 서열 정보를 전송하는 클라이언트 장치, miRNA 식별자와 상기 식별자로 식별되는 miRNA의 서열을 포함하는 miRNA 데이터베이스, miRNA 식별자와 상기 식별자와 매칭되는 타겟 유전자 정보를 포함하는 타겟 유전자 데이터베이스 및 수신한 상기 타겟 서열 정보를 기준으로 상기 miRNA 데이터베이스에서 매칭되는 타겟 miRNA를 추출하고, 상기 타겟 miRNA를 기준으로 상기 타겟 유전자 데이터베이스에서 매칭되는 타겟 유전자 정보를 추출하는 분석 서버를 포함한다.The miRNA analysis system based on variance processing is a client device that preprocesses raw data analyzed based on Next Generation Sequencing (NGS) to transmit target sequence information, a miRNA database including a miRNA identifier and a sequence of miRNA identified by the identifier, A target gene database including a miRNA identifier and target gene information matching the identifier and a target miRNA matched in the miRNA database are extracted based on the target sequence information received, and the target gene database is extracted based on the target miRNA. And an analysis server that extracts matching target gene information.

이하 설명하는 기술은 기술은 클라이언트 장치에서 NGS 분석 데이터를 전처리하여 miRNA 분석의 속도를 향상시킨다. 또한 이하 설명하는 기술은 분산 처리 과정에서 데이터 정규화하여 miRNA 분석의 정확도를 향상시킨다.The technique described below improves the speed of miRNA analysis by pre-processing NGS analysis data in a client device. In addition, the technique described below improves the accuracy of miRNA analysis by normalizing data during the dispersion processing.

도 1은 miRNA 분석 시스템에 대한 예이다.
도 2는 유전자 서열의 전처리 과정에 대한 순서도의 예이다.
도 3은 클라이언트 장치에서 분석 서버에 전달하기 위한 유전자 정보를 생성하는 과정에 대한 예이다.
도 4는 클라이언트 장치의 구성을 도시한 예이다.
도 5는 분석 서버가 DB를 이용하여 유전자 정보를 추출하는 동작에 대한 예이다.
도 6은 분석 서버가 DB를 이용하여 유전자 정보를 추출하는 과정에 대한 절차 흐름도이다.
도 7은 분석 서버가 타겟 유전자를 식별하는 과정에 대한 예이다.
도 8은 분석 서버가 추출한 유전자 정보에 대한 예이다.
도 9는 분석 서버가 추출한 유전자 정보에 대한 다른 예이다.
도 10은 분석 서버의 구조를 도시한 예이다.1 is an example of a miRNA analysis system.
Figure 2 is an example of a flow chart for the pre-processing of the gene sequence.
3 is an example of a process of generating genetic information for delivery to an analysis server from a client device.
4 is an example showing the configuration of a client device.
5 is an example of an operation in which an analysis server extracts genetic information using a DB.
6 is a flowchart of a procedure for a process in which an analysis server extracts genetic information using a DB.
7 is an example of a process for the analysis server to identify the target gene.
8 is an example of gene information extracted by the analysis server.
9 is another example of gene information extracted by the analysis server.
10 is an example showing the structure of the analysis server.

먼저 NGS 분석을 통해 생성되는 데이터에 대하여 간략하게 설명한다. 대표적인 포맷인 FASTQ를 기준으로 설명한다.First, data generated through NGS analysis will be briefly described. It is described based on FASTQ, a representative format.

NGS는 보통 100개 정도의 염기로 구성된 짧은 서열 조각인 리드(read)를 생성하여 염기서열을 해독한다. NGS는 해독한 염기서열을 일반적으로 FASTQ 형식의 파일로 저장한다. 이를 보통 원시(raw) 데이터라고 명명한다.NGS decodes the base sequence by generating a read, a short sequence fragment of about 100 bases. NGS stores the decoded sequences in a file in FASTQ format. This is usually called raw data.

NGS 리드의 길이는 약 100 bp 정도로 기존 Sanger 타입의 500-1,000 bp에 비하여 길이가 짧고, 시퀀싱 오류가 상대적으로 크며, 플랫폼에 의존하는 오류도 포함될 수 있다. NGS 플랫폼들이 생성하는 FASTQ 파일은 기존의 DNA 염기서열을 나타내는 텍스트 기반의 표준 염기 데이터 형식인 FASTA 형식에 해독한 염기의 정확도(quality score 혹은 error rate)를 포함시킨 것이다.The length of the NGS lead is about 100 bp, which is shorter than that of the existing Sanger type 500-1,000 bp, the sequencing error is relatively large, and platform-dependent errors may also be included. The FASTQ file generated by the NGS platforms includes the accuracy (quality score or error rate) of the decoded base in the FASTA format, a text-based standard base data format representing existing DNA sequences.

각 리드 당 생성되는 FASTQ 파일은 4 줄로 구성되는데, 첫째 줄은 @으로 시작하며 사용한 플랫폼과 염기서열 길이 등에 대한 정보를 포함하고 있고, 둘째 줄은 해독한 염기서열, 셋째 줄은 + 기호로 시작하며 기타 설명, 그리고 마지막 줄은 둘째 줄의 염기서열에 대한 정확도(quality score)를 표시한다. 따라서 둘째 줄과 넷째 줄은 같은 개수 정보로 구성된다. The FASTQ file generated for each read consists of 4 lines, the first line starts with @ and contains information about the platform and sequence length used, the second line starts with the decoded sequence, and the third line starts with a + sign. Other descriptions, and the last line indicate the quality score for the sequence in the second line. Therefore, the second and fourth lines are composed of the same number of information.

FASTQ 파일이 포함하고 있는 염기서열은 SNP/Indel calling을 위한 후속 분석 과정들에 지속적으로 영향을 미치기 때문에 염기서열의 정확도는 매우 중요하다. 또한 SNP는 인간의 경우 전체 게놈(genome)의 약 0.1% (약 1,000bp 중 1개) 정도 밖에 나타나지 않으므로 이를 확인하는 기술은 대단히 정확해야 하며, 시퀀싱 오류는 부정확한 SNP/Indel calling으로 이어질 수 있다.The accuracy of the sequence is very important because the sequence included in the FASTQ file continuously affects subsequent analysis procedures for SNP / Indel calling. In addition, since SNP is only about 0.1% of the total genome in humans (one out of about 1,000 bp), the technology to confirm this must be very accurate, and sequencing errors can lead to incorrect SNP / Indel calling. .

이하 설명하는 기술은 miRNA 데이터를 분석하는 시스템에 관한 것이다. 분석 시스템은 크게 두 가지 특징적 구성을 갖는다. 하나는 NGS 분석 데이터를 사전에 전처리하는 분산된 개별 클라이언트 장치이다. 또 다른 하나는 전처리된 데이터를 기준으로 매칭되는 miRNA 및 타겟 유전자 등을 추출하는 분석서버이다. The technique described below relates to a system for analyzing miRNA data. The analysis system has two characteristic configurations. One is a distributed individual client device that pre-processes NGS analysis data. Another is an analysis server that extracts matching miRNAs and target genes based on pre-processed data.

도 1은 miRNA 분석 시스템(100)에 대한 예이다. miRNA 분석 시스템(100)은 NGS 기법으로 생성한 원시 데이터를 이용하여 해당 샘플에 대한 분석 결과를 제공한다. miRNA 분석 시스템(100)은 클라이언트 장치(110), 분석 서버(130) 및 유전 정보 DB(150)를 포함한다.1 is an example of a miRNA analysis system 100. The miRNA analysis system 100 provides analysis results for a corresponding sample using raw data generated by NGS technique. The miRNA analysis system 100 includes a client device 110, an analysis server 130, and genetic information DB 150.

도 1은 3개의 클라이언트 장치(110A, 110B 및 110C)를 예로 도시하였다. 하나의 클라이언트 장치는 기본적으로 하나의 원시 데이터를 전처리한다. 클라이언트 장치(110C)는 NGS 분석 장치(50)가 생성한 원시 데이터를 처리한다. 도시하지 않았지만 다른 클라이언트 장치(110A 및 110B)도 NGS 분석 장치(50)가 생성한 원시 데이터를 획득하여 처리한다. 클라이언트 장치(110)가 수신하는 원시 데이터는 특정 샘플(조직)에 대한 유전자 분석 결과이다. 클라이언트 장치(110)는 miRNA 서열에 대한 원시 데이터를 획득한다고 전제한다. 상세한 전처리 과정은 후술한다.1 shows three client devices 110A, 110B and 110C as examples. One client device basically preprocesses one raw data. The client device 110C processes raw data generated by the NGS analysis device 50. Although not illustrated, other client devices 110A and 110B also acquire and process raw data generated by the NGS analysis device 50. The raw data received by the client device 110 is a result of genetic analysis of a specific sample (tissue). It is assumed that the client device 110 acquires raw data for the miRNA sequence. The detailed pre-treatment process will be described later.

클라이언트 장치(110)는 전처리한 데이터인 miRNA 서열 정보를 분석 서버(130)에 전송한다. 클라이언트 장치(110)가 전처리하여 전송하는 데이터를 타겟 서열 정보라고 명명한다. 타겟 서열 정보는 원시 데이터에 있는 서열 정보 중 적어도 하나의 서열 정보를 포함한다. 타겟 서열 정보는 서열 및 해당 서열의 반복횟수를 포함할 수도 있다.The client device 110 transmits miRNA sequence information, which is pre-processed data, to the analysis server 130. The data that the client device 110 preprocesses and transmits is called target sequence information. The target sequence information includes sequence information of at least one of sequence information in raw data. The target sequence information may include a sequence and the number of repetitions of the sequence.

분석 서버(130)는 수신한 타겟 서열 정보를 사전에 마련된 유전 정보와 비교하면서 매칭되는 miRNA를 식별한다. 이후 식별한 miRNA를 기준으로 유전 정보와 비교하면서 관련된 유전 정보를 추출한다. 분석 서버(130)는 추출한 유전 정보를 클라이언트 장치(110)에 제공할 수 있다.The analysis server 130 identifies the matching miRNA while comparing the received target sequence information with previously prepared genetic information. Then, based on the identified miRNA, the related genetic information is extracted while comparing with the genetic information. The analysis server 130 may provide the extracted genetic information to the client device 110.

유전 정보 DB(150)는 특정 miRNA에 대한 서열 정보, 특정 miRNA 서열과 관련된 타겟 유전자 정보, 특정 miRNA 서열 또는 타겟 유전자와 관련된 질병 정보, 타겟 유전자의 기작(pathway) 정보 등을 보유할 수 있다. 후술하겠지만 유전 정보 DB(150)는 복수의 DB(데이터베이스)로 구성될 수도 있다.The genetic information DB 150 may retain sequence information for a specific miRNA, target gene information related to a specific miRNA sequence, disease information related to a specific miRNA sequence or target gene, and pathway information of a target gene. As described later, the genetic information DB 150 may be composed of a plurality of DBs (databases).

NGS 데이터 전처리하는 클라이언트 장치Client device for preprocessing NGS data

이하 대용량의 FASTQ 파일을 전처리하는 클라이언트 장치에 대하여 설명한다. 물론 이하 설명하는 기술은 특정한 NGS 분석 플랫폼이나 특정한 파일 포맷에 한정되는 것은 아니다. 이하 설명하는 기술은 NGS 분석 데이터가 miRNA 데이터라고 가정한다. Hereinafter, a client device that preprocesses a large FASTQ file will be described. Of course, the technology described below is not limited to a specific NGS analysis platform or a specific file format. The technique described below assumes that the NGS analysis data is miRNA data.

이하 클라이언트 장치(110)가 원시 데이터를 전처리(preprecessing)하는 과정에 대하여 설명한다. 도 2는 유전자 서열의 전처리 과정(200)에 대한 순서도의 예이다. 클라이언트 장치는 NGS 분석 장치가 샘플을 분석하여 생성한 원시 파일을 입력받는다(210). 원시 파일은 샘플의 miRNA에 대한 데이터만을 포함할 수 있다. Hereinafter, the process of preprocessing the raw data by the client device 110 will be described. 2 is an example of a flow chart for the pre-processing process 200 of the gene sequence. The client device receives the raw file generated by analyzing the sample by the NGS analysis device (210). The raw file can only contain data for the miRNA of the sample.

클라이언트 장치는 원시 데이터에 포함된 서열에 대하여 중복 서열을 검출(sequence frequency)한다(220). 클라이언트 장치는 원시 데이터에 포함된 각 서열을 기준으로 하나의 서열과 동일한 서열에 원시 데이터에 존재하는지 검색하면서 각 서열의 중복 개수를 카운트(count)한다. 클라이언트 장치는 원시 데이터에 포함된 전체 서열에 대하여 중복 개수를 확인한다.The client device detects a duplicate sequence with respect to the sequence included in the raw data (sequence frequency) (220). The client device counts the number of duplicates of each sequence while searching for the existence of the original data in the same sequence as one sequence based on each sequence included in the raw data. The client device checks the number of duplicates for the entire sequence included in the raw data.

클라이언트 장치는 원시 데이터에 포함된 전체 서열 중 중복 횟수가 기준값 이상인 서열을 파악한다(230). 예컨대, 클라이언트 장치는 전체 서열 중 중복 횟수가 10보다 큰 서열을 선택할 수 있다. 중복 횟수가 기준값 이상인 서열을 후보 타겟 서열이라고 명명한다.The client device identifies a sequence in which the number of duplications of the entire sequence included in the raw data is greater than or equal to a reference value (230). For example, the client device may select a sequence having a number of overlapping greater than 10 among the entire sequences. Sequences in which the number of duplications is greater than or equal to the reference value are called candidate target sequences.

클라이언트 장치는 후보 타겟 서열을 정렬한다(240). 클라이언트 장치는 후보 타겟 서열을 각 서열의 중복 횟수를 기준으로 정렬한다. 클라이언트 장치는 후보 타겟 서열을 중복 횟수를 기준으로 오름차순으로 정렬할 수 있다. 전술한 바와 같이 FASTQ 파일은 매우 용량이 크다. 따라서 이를 한 번에 메모리에 읽어서 처리하기 어렵다. 따라서 클라이언트 장치는 특정한 후보 타겟 서열을 선택적으로 정렬한다. 후보 타겟 서열을 정하는 기준값 설정은 샘플의 특성, 원시 데이터의 예비 분석 결과, 시스템 사양 등을 고려하여 결정될 수 있다.The client device aligns the candidate target sequence (240). The client device sorts the candidate target sequence based on the number of duplicates of each sequence. The client device can sort the candidate target sequence in ascending order based on the number of duplicates. As mentioned above, FASTQ files are very large. Therefore, it is difficult to read and process it in memory at once. Thus, the client device selectively aligns specific candidate target sequences. The reference value setting for determining the candidate target sequence may be determined in consideration of characteristics of a sample, preliminary analysis results of raw data, system specifications, and the like.

클라이언트 장치는 정렬된 후보 타겟 서열을 기준으로 원시 데이터 또는 원시 데이터에 포함된 특정한 서열에 대한 정보를 정규화할 수 있다(250). 클라이언트 장치는 후보 타겟 서열의 중복 개수에 대한 정보를 정규화할 수 있다. The client device can normalize information about the raw data or specific sequences included in the raw data based on the aligned candidate target sequences (250). The client device can normalize information on the number of overlapping candidate target sequences.

나아가 도 2에 도시하지 않았지만 클라이언트 장치(110)는 원시 데이터에 포함된 다양한 서열 중 적어도 하나를 타겟 서열로 결정한다. 타겟 서열은 복수일 수도 있고, 하나의 서열일 수도 있다. Further, although not illustrated in FIG. 2, the client device 110 determines at least one of various sequences included in the raw data as a target sequence. The target sequence may be plural or one sequence.

클라이언트 장치는 정규화된 데이터에서 후보 타겟 서열 중 가장 중복 횟수(또는 중복 정도를 나타내는 값)가 많은 서열을 타겟 서열로 결정할 수도 있다. 또는 클라이언트 장치(110)는 정규화된 데이터에서 중복 회수를 기준으로 상위 몇 개의 서열을 타겟 서열로 결정할 수도 있다.In the normalized data, the client device may determine a sequence having the highest number of overlaps (or a value indicating the degree of overlap) among candidate target sequences as the target sequence. Alternatively, the client device 110 may determine the top few sequences as target sequences based on the number of duplicates in the normalized data.

도 3은 클라이언트 장치에서 분석 서버에 전달하기 위한 유전자 정보를 생성하는 과정에 대한 예이다. 클라이언트 장치(110)는 NGS 분석 장치(50)로부터 원시 데이터를 획득한다. 클라이언트 장치(110)는 네트워크를 통해 NGS 분석 장치(50)로부터 원시 데이터를 수신할 수 있다. 또는 클라이언트 장치(110)는 NGS 분석 장치(50)가 분석한 원시 데이터를 저장한 매체(USB, SD 카드 등)를 통해 원시 데이터를 입력받을 수도 있다.3 is an example of a process of generating genetic information for delivery to an analysis server from a client device. The client device 110 obtains raw data from the NGS analysis device 50. The client device 110 may receive raw data from the NGS analysis device 50 through the network. Alternatively, the client device 110 may receive raw data through a medium (USB, SD card, etc.) storing raw data analyzed by the NGS analysis device 50.

클라이언트 장치(110)는 먼저 원시 데이터에 포함된 각 서열에 대하여 중복 여부 및 중복 개수를 확인한다. 도 3은 "ATTGGGAAA"라는 서열을 기준으로 중복 횟수(4번)을 결정한 예를 도시한다. 이와 같이 클라이언트 장치(110)는 각 서열에 대하여 중복 횟수를 결정한다. 도 3은 클라이언트 장치(110)가 서열에 대하여 중복 횟수를 결정한 결과(테이블)의 예를 도시한다.The client device 110 first checks whether there is a duplicate and the number of duplicates for each sequence included in the raw data. Figure 3 shows an example of determining the number of duplicates (4 times) based on the sequence "ATTGGGAAA". As such, the client device 110 determines the number of duplicates for each sequence. FIG. 3 shows an example of a result (table) in which the client device 110 determines the number of duplicates for a sequence.

도 3에 도시한 테이블은 중복 개수, Row 수, 해당 중복 개수를 갖는 서열이 전체 서열에서 자치하는 비율(%)을 항목으로 갖는다. Row(행) 수는 FASTQ 파일에서 각 서열을 나타내는 행의 개수에 해당한다. 도 3에 도시한 테이블은 실제 샘플 데이터를 분석한 결과이다. 클라이언트 장치(110)는 중복 횟수 10을 기준으로 중복 횟수가 10보다 큰 서열을 후보 타겟 서열로 결정한다. 도 3에서 중복 횟수가 10보다 큰 서열은 약 1%이다. In the table shown in FIG. 3, the number of duplicates, the number of rows, and the percentage (%) in which sequences having the number of duplicates are autonomous in all sequences are items. The number of rows corresponds to the number of rows representing each sequence in the FASTQ file. The table shown in FIG. 3 is a result of analyzing actual sample data. The client device 110 determines, as a candidate target sequence, a sequence having a number of overlapping greater than 10 based on the number of overlapping 10. In FIG. 3, a sequence having a number of overlapping greater than 10 is about 1%.

클라이언트 장치(110)는 후보 타겟 서열에 대한 정렬을 수행한다. 정렬 방식은 다양한 방식이 사용될 수 있다. 다만 일반적으로 서열보다 중복 개수가 보다 크기가 작은 데이터이므로, 클라이언트 장치(110)는 카운트 정렬을 이용할 수 있다. 클라이언트 장치(110)는 중복 횟수를 기준으로 서열을 정렬(예컨대, 오름 차순 정렬)한다.The client device 110 performs alignment to the candidate target sequence. Various sorting methods can be used. However, in general, since the number of duplicates is smaller than the sequence, the client device 110 may use count alignment. The client device 110 sorts the sequences based on the number of overlaps (eg, ascending order).

클라이언트 장치(110)는 카운트 정렬을 수행한 결과를 이용하여 데이터를 정규화한다. 데이터를 정규화하는 이유는 특정 샘플에 따라서 유전자 발현량이 다를 수도 있고, NGS 분석 장치의 분석 결과에 일부 오류가 있을 수 있기 때문이다. 클라이언트 장치(110)는 다양한 방식으로 데이터를 정규화할 수 있다. The client device 110 normalizes the data using the result of performing the count sort. The reason for normalizing the data is that the amount of gene expression may differ depending on the specific sample, and there may be some errors in the analysis result of the NGS analysis device. The client device 110 can normalize data in various ways.

FPKM (fragments per kilo bases of exons for per million mapped reads) 또는 RPKM ( fragments per kilo bases of exons for per million mapped reads)는 RNA 리드의 개수를 이용하여 전사량을 추정하는 과정에서 널리 사용된 정규화 방법이다. 그러나 특정 샘플의 유전자 발현량이 많다면, 해당 샘플이 더 많은 리드 개수를 갖게된다. 따라서 유전자 서령에 기초한 정략적 연구에서 잘못된 결과를 유도할 수 있다. 이와 같은 문제를 억제하기 위하여 샘플에 대한 정규화가 바람직하다. 예컨대, 샘플 데이터에 대한 상위 사분위 정규화(Uppper Quartile normalization, 이하 UQ 정규화)를 이용할 수 있다. 이하 UQ 정규화 과정을 중심으로 설명한다. UQ 정규화도 몇 가지 방식을 사용할 수 있다. UQ 정규화를 위해서는 일정한 기준으로 데이터를 정렬해야만 한다. 따라서 클라이언트 장치(110)는 중복 횟수를 기준으로 서열을 정렬한 것이다.Fragments per kilo bases of exons for per million mapped reads (FPKM) or fragments per kilo bases of exons for per million mapped reads (RPKM) are widely used normalization methods in estimating the amount of transcription using the number of RNA reads. . However, if the gene expression amount of a specific sample is large, the sample has a higher read count. Therefore, it is possible to induce false results in a systematic study based on gene dictation. Normalization of the sample is desirable to suppress this problem. For example, upper quartile normalization (hereinafter referred to as UQ normalization) for sample data may be used. Hereinafter, the UQ normalization process will be mainly described. UQ normalization can also use several methods. UQ normalization requires sorting data on a regular basis. Therefore, the client device 110 sorts the sequences based on the number of duplicates.

이하 행렬(matrix) 형태의 자료구조를 기준으로 설명한다. 각 행은 유전자 또는 전사체를 의미한고, 각 열은 서로 다른 샘플을 의미한다. 다만 하나의 클라이언트 장치는 하나의 샘플 데이터에 대한 정규화를 수행하므로, 하나의 열만 갖는 자료구조를 사용할 수 있다. 각 셀은 유전자 서열의 반복 횟수 등을 나타내는 정보를 포함한다. Hereinafter, description will be made based on a matrix type data structure. Each row represents a gene or transcript, and each column represents a different sample. However, since one client device normalizes one sample data, a data structure having only one column can be used. Each cell contains information indicating the number of repetitions of the gene sequence.

(1) (i) 상위 75%에 해당하는 값(uppper quratile)을 오름 차순으로 정렬한다. (ii) 각 행의 발현 정도(서열의 반복 횟수)를 상위 사분위 값으로 나눈다. 이 값을 최종 정규화 결과로 이용할 수 있다. (ii) 나아가 상위 사분위 값으로 나눈 값에 전체 일정한 값(예컨대, 전체 서열의 평균 중복 횟수)을 곱하여 최종 결과를 도출할 수도 있다.(1) (i) Sort the upper 75% (uppper quratile) in ascending order. (ii) The expression level (number of repetitions of sequence) in each row is divided by the upper quartile value. You can use this value as the final normalization result. (ii) Furthermore, the final result may be derived by multiplying the value divided by the upper quartile value with the total constant value (eg, the average number of overlapping of the entire sequence).

(2) (i) 상위 75%에 해당하는 값(uppper quratile)을 오름 차순으로 정렬한다. (ii) 각 행의 발현 정도(서열의 반복 횟수)에 대한 상위 사분위 값을 결정한다. (iii) 상위 사분위에 속하는 서열의 반복 횟수를 합산한 값으로 각 서열의 반복 회수를 나눈다. 이 값을 최종 정규화 결과로 이용할 수 있다. (iv) 나아가 상기 나눈 결과 값에 전체 일정한 값(예컨대, 전체 서열의 평균 중복 횟수)을 곱하여 최종 결과를 도출할 수도 있다.(2) (i) Sort the values corresponding to the top 75% (uppper quratile) in ascending order. (ii) Determine the upper quartile value for the expression level (number of repetitions of sequence) in each row. (iii) The number of repetitions of each sequence is divided by the sum of the number of repetitions of the sequences belonging to the upper quartile. You can use this value as the final normalization result. (iv) Furthermore, the final result may be derived by multiplying the divided result value by a total constant value (eg, the average number of overlaps of all sequences).

(3) (i) 상위 75%에 해당하는 값(uppper quratile)을 오름 차순으로 정렬한다. (ii) 각 행의 발현 정도(서열의 반복 횟수)에 대한 상위 사분위 값을 결정한다. (iii) 상위 사분위에 속하는 서열의 반복 횟수의 평균 값으로 각 서열의 반복 회수를 나눈다. 이 값을 최종 정규화 결과로 이용할 수 있다. (iv) 나아가 상기 나눈 결과 값에 전체 일정한 값(예컨대, 전체 서열의 평균 중복 횟수)을 곱하여 최종 결과를 도출할 수도 있다.(3) (i) Sort the upper 75% (uppper quratile) in ascending order. (ii) Determine the upper quartile value for the expression level (number of repetitions of sequence) in each row. (iii) The number of repetitions of each sequence is divided by the average value of the number of repetitions of the sequence belonging to the upper quartile. You can use this value as the final normalization result. (iv) Furthermore, the final result may be derived by multiplying the divided result value by a total constant value (eg, the average number of overlaps of all sequences).

도 3은 정규화한 결과에 대한 예를 도시한다. 클라이언트 장치(110)는 원시 데이터에 포함되었던 서열 중 적어도 하나의 서열과 해당 서열에 대한 중복 개수를 생성할 수 있다. 이때 중복 개수는 정규화된 데이터이다. 3 shows an example of the normalized result. The client device 110 may generate at least one sequence among the sequences included in the raw data and the number of duplicates of the sequence. At this time, the number of duplicates is normalized data.

클라이언트 장치(110)는 중복 개수가 특정값 이상인 서열을 최종 서열 정보로 선택할 수 있다. 클라이언트 장치(110)는 최종 서열 정보를 분석 서버(130)에 전송할 수 있다. 클라이언트 장치(110)는 최종 서열 정보와 해당 서열에 대한 중복 개수를 분석 서버(130)에 전송할 수 있다. 경우에 따라서 클라이언트 장치(110)는 전체 서열과 해당 서열의 중복 횟수를 분석 서버(130)에 전송할 수 있다.The client device 110 may select a sequence having a number of duplicates or more as a specific value as final sequence information. The client device 110 may transmit the final sequence information to the analysis server 130. The client device 110 may transmit the final sequence information and the number of duplicates of the sequence to the analysis server 130. In some cases, the client device 110 may transmit the entire sequence and the number of duplicates of the sequence to the analysis server 130.

도 4는 클라이언트 장치(110)의 구성을 도시한 예이다. 클라이언트 장치(110)는 PC, 노트북, 스마트기기 또는 서버 등과 같은 장치를 의미한다. 클라이언트 장치(110)는 입력장치(111), 연산장치(112), 저장장치(113), 통신 장치(114) 및 출력장치(115)를 포함한다. 4 is an example showing the configuration of the client device 110. The client device 110 refers to a device such as a PC, laptop, smart device, or server. The client device 110 includes an input device 111, a computing device 112, a storage device 113, a communication device 114, and an output device 115.

입력장치(111)는 NGS 분석 장치가 분석한 결과인 miRNA 데이터를 입력받는다. 유전자 데이터는 타겟 유전자의 발현에 관련된 데이터 내지 유전자 서열을 의미한다. 입력장치(111)는 miRNA 데이터를 통신이나 별도의 저장 매체를 통해 클라이언트 장치(110)에 입력하는 데이터 입력 장치이다.The input device 111 receives miRNA data as a result of the analysis by the NGS analysis device. Gene data refers to data or gene sequences related to the expression of a target gene. The input device 111 is a data input device that inputs miRNA data to the client device 110 through communication or a separate storage medium.

저장장치(113)는 전술한 데이터 전처리를 위한 프로그램을 저장한다. 저장장치(113)는 클라이언트 장치(110)에 연결된 하드디스크, 플래시 메모리 등일 수 있다. 저장장치(113)는 입력장치(111)로부터 전달받은 miRNA 데이터를 저장할 수 있다. 저장장치(113)는 원시 데이터를 전처리한 최종 데이터를 저장할 수 있다. The storage device 113 stores a program for data pre-processing described above. The storage device 113 may be a hard disk, flash memory, etc. connected to the client device 110. The storage device 113 may store miRNA data received from the input device 111. The storage device 113 may store final data preprocessed from the raw data.

연산 장치(112)는 저장장치(113)에 저장된 프로그램을 실행하여 원시 데이터를 전처리한다. 연산 장치(112)는 전술한 바와 같이 원시 데이터 중 서열의 중복 횟수를 기준으로 데이터를 정규화한다.The computing device 112 preprocesses the raw data by executing a program stored in the storage device 113. As described above, the computing device 112 normalizes data based on the number of overlapping sequences among the raw data.

클라이언트 장치(110)는 ActiveX와 같은 설치 프로그램 없이 종래 PC에 설치된 JAVA를 사용하여 프로그램을 실행할 수 있다. 예컨대, 사용자가 분석 서비스를 제공하는 웹에 접속하여 로그인하면 자동으로 자바 클라이언트를 실행하여 전처리를 위한 프로그램을 구동할 수 있다.The client device 110 may execute a program using JAVA installed in a conventional PC without an installation program such as ActiveX. For example, when a user connects to the web providing an analysis service and logs in, a Java client can be automatically executed to drive a program for preprocessing.

통신장치(114)는 NGS 분석 장치(50)가 전송하는 원시 데이터를 수신할 수 있다. 통신장치(114)는 연산 장치(112)가 전처리한 데이터를 분석 서버(130)에 송신할 수 있다.The communication device 114 may receive raw data transmitted by the NGS analysis device 50. The communication device 114 may transmit data preprocessed by the computing device 112 to the analysis server 130.

출력장치(115)는 전처리한 결과 데이터 또는/및 분석 서버(130)로부터 전달받은 분석 결과(유전 정보)를 출력하는 장치이다. The output device 115 is a device that outputs pre-processed result data and / or analysis results (genetic information) received from the analysis server 130.

miRNA 분석을 수행하는 분석 서버Analysis server performing miRNA analysis

분석 서버(130)는 클라이언트 장치(110)가 정규화한 데이터(타겟 서열 정보)를 수신한다. 이하 분석 서버(130)가 타겟 서열 정보를 이용하여 유전자 정보를 추출하는 과정에 대하여 설명한다.The analysis server 130 receives data (target sequence information) normalized by the client device 110. Hereinafter, a process in which the analysis server 130 extracts genetic information using target sequence information will be described.

도 5는 분석 서버가 DB를 이용하여 유전자 정보를 추출하는 동작에 대한 예이다. 분석 서버(130)는 먼저 타겟 서열로 miRNA를 식별하고, 이후 다양한 DB를 활용하여 관련된 정보를 추출할 수 있다.5 is an example of an operation in which an analysis server extracts genetic information using a DB. The analysis server 130 may first identify the miRNA as a target sequence, and then extract various related information using various DBs.

분석 서버(130)는 클라이언트 장치(110)가 전처리한 타겟 서열 정보를 수신한다. 타겟 서열 정보는 기준값 이상 반복되는 타겟 서열을 포함할 수 있다. 타겟 서열 정보는 타겟 서열과 타겟 서열의 반복 횟수를 포함할 수 있다. 나아가 타겟 서열 정보는 원시 데이터를 기준으로 정규화된 전체 서열(서열 및 정규화된 반복 횟수)에 대한 정보를 포함할 수도 있다. 분석 서버(130)는 타겟 서열 정보를 기준으로 miRNA DB(151)에서 매칭되는 miRNA를 식별한다. 타겟 서열 정보에 매칭되는 miRNA를 매칭 miRNA라고 명명한다.The analysis server 130 receives target sequence information preprocessed by the client device 110. The target sequence information may include a target sequence repeating over a reference value. The target sequence information may include the target sequence and the number of repetitions of the target sequence. Furthermore, the target sequence information may include information on the entire normalized sequence (sequence and normalized repeat count) based on the raw data. The analysis server 130 identifies miRNAs matched in the miRNA DB 151 based on target sequence information. The miRNA matching the target sequence information is called a matching miRNA.

miRNA DB(151)는 연구기관이나 기업에서 서비스하는 상용 DB일 수 있다. miRNA DB(151)는 상용 DB에 저장된 정보를 일정하게 가공한 정보를 보유할 수도 있다. miRNA DB(151)는 복수의 상용 DB에 저장된 정보를 일정한 포맷으로 가공한 정보를 보유할 수도 있다. miRNA DB(151)는 miRNA 서열 및 해당 서열을 지칭하는 심볼(명칭)을 포함할 수 있다. miRNA 서열은 다른 정보에 비하여 크기가 클 수 있다. 따라서 miRNA DB(151)는 miRNA 식별자(ID) 및 심볼만으로 구성될 수도 있다. 이 경우 분석 서버(130)는 조금 더 빨리 매칭 miRNA를 식별할 수 있다. The miRNA DB 151 may be a commercial DB serviced by a research institute or company. The miRNA DB 151 may retain information that is regularly processed from information stored in a commercial DB. The miRNA DB 151 may retain information processed in a certain format of information stored in a plurality of commercial databases. The miRNA DB 151 may include a miRNA sequence and a symbol (name) indicating the sequence. The miRNA sequence may be larger in size than other information. Therefore, the miRNA DB 151 may be composed of only the miRNA identifier (ID) and symbols. In this case, the analysis server 130 can identify the matching miRNA a little faster.

이후 분석 서버(130)는 매칭 miRNA를 기준으로 매칭 miRNA가 작용하는 타겟 유전자(mRNA 등)를 식별할 수 있다. 분석 서버(130)는 매칭 miRNA를 기준으로 타겟 유전자 DB(152)에서 매칭되는 miRNA를 식별한다. Thereafter, the analysis server 130 may identify a target gene (mRNA, etc.) on which the matching miRNA acts based on the matching miRNA. The analysis server 130 identifies the miRNA matched in the target gene DB 152 based on the matching miRNA.

매칭 miRNA가 작용하는 타겟 유전자는 다양할 수도 있다. 또한 타겟 유전자는 유형이 서로 다른 RNA일 수도 있다. 이 경우 분석 서버(130)는 매칭 miRNA가 작용하는 복수의 타겟 유전자에 대한 정보를 추출할 수도 있다. 예컨대, 분석 서버(130)는 Matured RNA, tRNA, rRNA 및 piRNA 중 적어도 하나의 타겟 유전자에 대한 정보를 추출할 수 있다. 이 경우 타겟 유전자 DB(152)는 서로 다른 유형의 RNA에 대한 복수의 DB로 구성될 수 있다. 물론 타겟 유전자 DB(152)는 서로 다른 유형의 RNA에 대한 정보를 모두 포함하는 하나의 DB일 수도 있다. 타겟 유전자의 유형이 상이하다면, 해당 유형을 식별하기 위한 별도의 식별자가 필요하다.The target gene on which the matching miRNA acts may vary. The target gene may also be RNA of different types. In this case, the analysis server 130 may extract information on a plurality of target genes acting on the matching miRNA. For example, the analysis server 130 may extract information on at least one target gene among Matured RNA, tRNA, rRNA and piRNA. In this case, the target gene DB 152 may be composed of a plurality of DBs for different types of RNA. Of course, the target gene DB 152 may be a single DB including all information on different types of RNA. If the type of target gene is different, a separate identifier for identifying the type is required.

타겟 유전자 DB(152)는 연구기관이나 기업에서 서비스하는 상용 DB일 수 있다. 타겟 유전자 DB(152)는 상용 DB에 저장된 정보를 일정하게 가공한 정보를 보유할 수도 있다. 타겟 유전자 DB(152)는 복수의 상용 DB에 저장된 정보를 일정한 포맷으로 가공한 정보를 보유할 수도 있다. 타겟 유전자 DB(152)는 miRNA 식별자 및 해당 miRNA가 작용하는 타겟 유전자의 식별자를 포함한다. 타겟 유전자 DB(152)는 타겟 유전자의 심볼을 더 포함할 수 있다. 예컨대, 분석 서버(130)는 매칭 miRNA의 식별자를 기준으로 타겟 유전자를 추출할 수 있다.The target gene DB 152 may be a commercial DB serviced by a research institute or company. The target gene DB 152 may retain information that is regularly processed from information stored in a commercial DB. The target gene DB 152 may retain information processed in a certain format of information stored in a plurality of commercial databases. The target gene DB 152 includes an miRNA identifier and an identifier of a target gene on which the miRNA operates. The target gene DB 152 may further include a symbol of the target gene. For example, the analysis server 130 may extract the target gene based on the identifier of the matching miRNA.

이후 분석 서버(130)는 타겟 유전자를 기준으로 이후 추가적인 유전 정보를 추출할 수 있다. Thereafter, the analysis server 130 may extract additional genetic information based on the target gene.

분석 서버(130)는 타겟 유전자의 식별자(또는 심볼)를 기준으로 유전자 온톨로지 DB(153)에서 관련된 온톨로지 정보를 추출할 수 있다. 유전자 온톨로지(ontology)는 유전자를 분류하는 유전자 카테고리라고 할 수 있다. 이를 GO TERM 분석이라고도 한다. 유전자 온톨로지는 식별한 타겟 유전자의 생물학적 기능에 대한 정보를 포함한다. 유전자 온톨로지는 타겟 유전자가 특정 질병에 연관된다는 정보를 포함할 수도 있다. The analysis server 130 may extract related ontology information from the gene ontology DB 153 based on the identifier (or symbol) of the target gene. Gene ontology can be said to be a gene category for classifying genes. This is also referred to as GO TERM analysis. Gene ontology includes information on the biological function of the identified target gene. Gene ontology may include information that a target gene is associated with a specific disease.

유전자 온톨로지 DB(153)는 연구기관이나 기업에서 서비스하는 상용 DB일 수 있다. 유전자 온톨로지 DB(153)는 상용 DB에 저장된 정보를 일정하게 가공한 정보를 보유할 수도 있다. 유전자 온톨로지 DB(153)는 복수의 상용 DB에 저장된 정보를 일정한 포맷으로 가공한 정보를 보유할 수도 있다. 유전자 온톨로지 DB(153)는 miRNA 식별자 및 해당 miRNA에 대한 온톨로지의 식별자를 포함한다. 유전자 온톨로지 DB(153)는 동일한 온톨로지로 분로되는 다른 유전자, 해당 온톨로지가 관여하는 질병에 대한 정보를 더 포함할 수 있다. The gene ontology DB 153 may be a commercial DB serviced by a research institute or company. The gene ontology DB 153 may retain information that is regularly processed from information stored in a commercial DB. The gene ontology DB 153 may retain information processed in a certain format of information stored in a plurality of commercial databases. The gene ontology DB 153 includes an miRNA identifier and an ontology identifier for the miRNA. The gene ontology DB 153 may further include information on other genes that are divided into the same ontology, and diseases related to the ontology.

분석 서버(130)는 타겟 유전자의 식별자(또는 심볼)를 기준으로 패스웨이 DB(153)에서 관련된 패스웨이 정보를 추출할 수 있다. 패스웨이(pathway) 정보는 해당 타겟 유전자가 관여하는 생물학적 패스웨이에 대한 정보이다. 타겟 유전자는 생물학적으로 특정한 기작에 관여한다. 패스웨이 정보는 타겟 유전자가 전체 기작에서 관여하는 특정한 경로(특정 지점)에 대한 정보를 의미한다.The analysis server 130 may extract related pathway information from the pathway DB 153 based on the identifier (or symbol) of the target gene. Pathway information is information about a biological pathway in which the target gene is involved. Target genes are involved in biologically specific mechanisms. Pathway information refers to information on a specific pathway (a specific point) in which the target gene is involved in the entire mechanism.

패스웨이 DB(154)는 연구기관이나 기업에서 서비스하는 상용 DB일 수 있다. 패스웨이 DB(154)는 상용 DB에 저장된 정보를 일정하게 가공한 정보를 보유할 수도 있다. 패스웨이 DB(154)는 복수의 상용 DB에 저장된 정보를 일정한 포맷으로 가공한 정보를 보유할 수도 있다. 패스웨이 DB(154)는 miRNA 식별자 및 해당 miRNA가 관여하는 패스웨이의 식별자를 포함한다. 패스웨이 DB(154)는 miRNA가 관여하는 특정 패스웨이를 포함하는 기작 전체의 패스웨이에 대한 정보를 더 포함할 수 있다. Pathway DB 154 may be a commercial DB serviced by a research institute or company. The pathway DB 154 may retain information processed by processing information stored in a commercial DB. Pathway DB 154 may retain information processed in a certain format of information stored in a plurality of commercial databases. The pathway DB 154 includes an identifier of a miRNA and an identifier of a pathway in which the miRNA is involved. The pathway DB 154 may further include information about the pathway of the entire mechanism including a specific pathway involved in miRNA.

도 6은 분석 서버가 DB를 이용하여 유전자 정보를 추출하는 과정(300)에 대한 절차 흐름도이다. 6 is a flowchart of a procedure for the process 300 in which the analysis server extracts genetic information using a DB.

분석 서버(130)는 클라이언트 장치(110)가 전처리한 타겟 서열 정보를 수신한다(301). The analysis server 130 receives target sequence information preprocessed by the client device 110 (301).

분석 서버(130)는 타겟 서열 정보를 기준으로 miRNA DB(151)에서 매칭되는 miRNA를 식별한다. 매칭 miRNA를 추출하는 과정에 대하여 설명한다. The analysis server 130 identifies miRNAs matched in the miRNA DB 151 based on target sequence information. The process of extracting the matching miRNA will be described.

(1) 분석 서버(130)는 타겟 서열 정보에 포함된 타겟 서열을 miRNA DB(151)에 쿼리한다(311). miRNA DB(151)는 수신한 쿼리에 매칭되는 miRNA 서열 정보를 분석 서버(130)에 전달한다. miRNA 서열 정보는 miRNA 식별자 또는 miRNA 심볼을 포함한다. miRNA 서열 정보는 식별한 miRNA의 서열 정보를 포함할 수도 있다. miRNA DB(151)는 mirBase와 같은 상용 데이터베이스일 수 있다. mirBase는 miRNA 이름과 시퀀스 정보를 테이블 형태로 보유하고 있다. (1) The analysis server 130 queries the target sequence included in the target sequence information to the miRNA DB 151 (311). The miRNA DB 151 transmits miRNA sequence information matching the received query to the analysis server 130. The miRNA sequence information includes miRNA identifiers or miRNA symbols. The miRNA sequence information may include sequence information of the identified miRNA. The miRNA DB 151 may be a commercial database such as mirBase. mirBase holds miRNA name and sequence information in a table format.

분석 서버(130)는 타겟 서열을 쿼리하면서 타겟 서열 전체를 그대로 쿼리할 수 있다. 이 경우 타겟 서열 전체와 매칭되는 miRNA를 검색하게 된다. miRNA DB(151)는 매칭율이 가장 높은 어느 하나의 miRNA를 응답 데이터로 송신할 수 있다. The analysis server 130 may query the entire target sequence as it is while querying the target sequence. In this case, miRNAs matching the entire target sequence are searched. The miRNA DB 151 may transmit any miRNA having the highest matching rate as response data.

다만 타겟 서열 전체를 기준으로 하면 매칭율이 낮아질 수 있다. NGS 분석 기법을 통한 서열 분석에 일정한 오류가 발생할 가능성이 있기 때문이다. 따라서 분석 서버(130)는 타겟 서열을 쿼리하면서 타겟 서열 중 일부를 제외하고 쿼리할 수 있다. miRNA DB(151)는 쿼리된 서열만을 기준으로 매칭되는 miRNA를 검색한다. 예컨대, 분석 서버(130)는 SELECT 문과 같은 MySQL 문을 이용하여 시퀀스 정보를 쿼리할 수 있는데 이때 와일드카드(*)를 사용할 수 있다. "SELECT(*ATTGGGAAA*)" 방식으로 쿼리할 수 있다. 이 경우 와일드카드에 해당하는 염기 서열은 어떠한 서열이라도 매칭된다고 판단된다. 다른 말로 하면 분석 서버(130)는 타겟 서열 중 와일드카드에 해당하는 염기 서열을 제외한 나머지 서열을 쿼리하는 것이다.However, the matching rate may be lowered based on the entire target sequence. This is because there is a possibility that a certain error may occur in sequence analysis through the NGS analysis technique. Therefore, the analysis server 130 may query a target sequence while excluding some of the target sequences. The miRNA DB 151 searches for a matching miRNA based only on the queried sequence. For example, the analysis server 130 may query sequence information using a MySQL statement, such as a SELECT statement, where a wildcard (*) can be used. You can query using "SELECT (* ATTGGGAAA *)". In this case, it is determined that the base sequence corresponding to the wildcard matches any sequence. In other words, the analysis server 130 queries the remaining sequences other than the base sequence corresponding to the wildcard among the target sequences.

(2) 전술한 바와 같이 타겟 서열 정보는 복수의 타겟 서열을 포함할 수 있다. 이 경우 분석 서버(130)는 복수의 타겟 서열 각각을 쿼리할 수 있다. 이후 분석 서버(130)는 각 복수의 쿼리 결과를 수신한다.(2) As described above, the target sequence information may include a plurality of target sequences. In this case, the analysis server 130 may query each of a plurality of target sequences. Thereafter, the analysis server 130 receives a plurality of query results.

분석 서버(130)는 miRNA DB(151)로부터 타겟 서열을 쿼리한 결과를 수신한다(312). 타겟 서열에 매칭되는 하나의 miRNA의 식별자가 수신되면, 분석 서버(130)는 타겟 서열은 수신한 miRNA에 매칭된다고 식별한다(321). 전술한 바와 같이 타겟 서열 정보는 복수의 타겟 서열을 포함할 수 있고, 이 경우 miRNA DB(151)는 각 타겟 서열에 대하여 매칭되는 miRNA 식별자를 송신할 수 있다(312). 분석 서버(130)는 복수의 miRNA 식별자를 수신하는 경우, 가장 개수가 많은 miRNA를 최종적인 miRNA로 식별할 수도 있다(321).The analysis server 130 receives the result of querying the target sequence from the miRNA DB 151 (312). When an identifier of one miRNA matching the target sequence is received, the analysis server 130 identifies that the target sequence matches the received miRNA (321). As described above, the target sequence information may include a plurality of target sequences, and in this case, the miRNA DB 151 may transmit a matching miRNA identifier for each target sequence (312). When receiving a plurality of miRNA identifiers, the analysis server 130 may identify the highest number of miRNAs as the final miRNA (321).

분석 서버(130)는 miRNA 식별자를 타겟 유전자 DB(152)에 쿼리한다(331). 타겟 유전자 DB(152)는 miRNA 식별자를 기준으로 해당 miRNA가 작용하는 타겟 유전자를 검색하고, 검색된 타겟 유전자 정보를 송신한다(332). 분석 서버(130)는 수신한 타겟 유전자 정보를 기준으로 매칭 miRNA에 대한 타겟 유전자를 식별한다(341). 전술한 바와 같이 타겟 유전자는 해당 miRNA가 기작에 관여하는 적어도 하나의 mRNA 등에 해당한다.The analysis server 130 queries the target gene DB 152 for the miRNA identifier (step 331). The target gene DB 152 searches for a target gene on which the corresponding miRNA operates based on the miRNA identifier, and transmits the searched target gene information (332). The analysis server 130 identifies the target gene for the matching miRNA based on the received target gene information (341). As described above, the target gene corresponds to at least one mRNA that miRNA is involved in the mechanism.

분석 서버(130)는 타겟 유전자 식별자를 유전자 온톨로지 DB(153)에 쿼리한다(351). 유전자 온톨로지 DB(153)는 타겟 유전자 식별자를 기준으로 유전자 온톨로지 정보를 검색하고, 해당 타겟 유전자에 대한 유전자 온톨로지 정보를 송신한다(352). 분석 서버(130)는 수신한 온톨로지 정보를 기준으로 타겟 유전자에 대한 온톨로지 정보를 식별한다(361). 전술한 바와 같이 온톨로지 정보는 타겟 유전자의 기능적 분류, 관련된 다른 유전자, 관련된 질병 등에 대한 정보를 포함할 수 있다.The analysis server 130 queries the target gene identifier to the gene ontology DB 153 (351). The gene ontology DB 153 searches for gene ontology information based on the target gene identifier and transmits the gene ontology information for the target gene (352). The analysis server 130 identifies ontology information for the target gene based on the received ontology information (361). As described above, the ontology information may include information on functional classification of target genes, related other genes, and related diseases.

분석 서버(130)는 타겟 유전자 식별자를 유전자 패스웨이 DB(154)에 쿼리한다(371). 패스웨이 DB(154)는 타겟 유전자식별자를 기준으로 패스 웨이 정보를 검색하고, 해당 타겟 유전자에 대한 패스웨이 정보를 송신한다(372). 분석 서버(130)는 수신한 패스웨이 정보를 기준으로 타겟 유전자에 대한 패스웨이 정보를 식별한다(381).The analysis server 130 queries the target gene identifier to the gene pathway DB 154 (371). The pathway DB 154 searches for the pathway information based on the target gene identifier and transmits the pathway information for the target gene (372). The analysis server 130 identifies the pathway information for the target gene based on the received pathway information (381).

도 7은 분석 서버가 타겟 유전자를 식별하는 과정에 대한 예이다. 도 7(A)는 하나의 miRNA DB를 사용하는 경우이다. miRNA DB는 수신한 타겟 서열을 기준으로 자신이 보유한 miRNA의 서열과 비교하면서 매칭되는 miRNA를 검색한다. 도 7(A)는 검색한 결과에 해당하고, miRNA 1, miRNA 2 및 miRNA 3에 대한 매칭율을 도시한다. miRNA DB는 타겟 서열에 대하여 매칭율이 가장 높은 miRNA 1을 최종 결과로 출력할 수 있다.7 is an example of a process for the analysis server to identify the target gene. 7 (A) is a case in which one miRNA DB is used. The miRNA DB searches for a matching miRNA by comparing it with the sequence of the miRNA possessed by the target sequence received. Figure 7 (A) corresponds to the search results, and shows the matching rate for miRNA 1, miRNA 2 and miRNA 3. The miRNA DB may output miRNA 1 having the highest matching rate with respect to a target sequence as a final result.

도 7(B)는 두 개의 miRNA DB를 사용하는 경우이다. miRNA DB 1 및 miRNA DB2는 각각 수신한 타겟 서열을 기준으로 자신이 보유한 miRNA의 서열과 비교하면서 매칭되는 miRNA를 검색한다. 도 7(B)는 검색한 결과에 해당하고, miRNA 1, miRNA 2 및 miRNA 3에 대한 매칭율을 도시한다. miRNA DB 1는 타겟 서열에 대하여 매칭율이 가장 높은 miRNA 1을 최종 결과로 출력할 수 있다. miRNA DB 2도 타겟 서열에 대하여 매칭율이 가장 높은 miRNA 1을 최종 결과로 출력할 수 있다. 분석 서버(130)는 복수의 miRNA DB로부터 수신한 결과를 종합하여 최종적으로 타겟 서열에 매칭하는 miRNA를 결정할 수 있다. 7 (B) shows a case in which two miRNA DBs are used. miRNA DB 1 and miRNA DB2 search for matching miRNAs by comparing the sequences of their own miRNAs based on the received target sequence, respectively. 7 (B) corresponds to the search result, and shows the matching rates for miRNA 1, miRNA 2 and miRNA 3. miRNA DB 1 may output miRNA 1 having the highest matching rate with respect to a target sequence as a final result. miRNA DB 2 can also output the miRNA 1 having the highest matching rate with respect to the target sequence as a final result. The analysis server 130 may synthesize the results received from the plurality of miRNA DBs and finally determine the miRNA matching the target sequence.

복수의 miRNA DB로부터 수신한 결과가 서로 상이하다면, (1) 분석 서버(130)는 복수의 결과 중 매칭율이 상대적으로 높은 miRNA를 최종적인 miRNA로 식별하고, 이후 분석 과정을 수행할 수 있다. 또는 (2) 분석 서버(130)는 클라이언트 장치(110)에 분석 실패 메시지를 전송할 수 있다.If the results received from the plurality of miRNA DBs are different from each other, (1) the analysis server 130 may identify miRNAs having a relatively high matching rate among the plurality of results as final miRNAs, and then perform an analysis process. Or (2) the analysis server 130 may transmit an analysis failure message to the client device 110.

분석 서버(130)는 타겟 서열과 miRNA의 매칭율과 같은 부가 정보를 클라이언트 장치(110)에 제공할 수 있다.The analysis server 130 may provide the client device 110 with additional information such as a target sequence and miRNA matching rate.

도 8은 분석 서버가 추출한 유전자 정보에 대한 예이다. 도 8(A)는 miRNA로 식별한 타겟 유전자에 관한 정보의 예이다. 도 8(A)는 타겟 유전자를 식별한 miRNA 소스 DB(Source DB), 타겟 유전자의 심볼(Symbol) 및 타겟 유전자의 서열(Sequence)을 도시한다. 8 is an example of gene information extracted by the analysis server. 8 (A) is an example of information on a target gene identified by miRNA. 8 (A) shows the miRNA source DB identifying the target gene (Source DB), the symbol of the target gene (Symbol), and the sequence of the target gene (Sequence).

분석 서버(130)는 타겟 유전자에 관련된 정보를 클라이언트 장치(110)에 전송할 수 있다. 여기서 관련된 정보는 타겟 유전자 식별자, 타겟 유전자 심볼, 타겟 유전자 서열 등을 포함한다. 클라이언트 장치(110)는 타겟 유전자(Matured RNA, tRNA, rRNA, piRNA 등)의 종류, 타겟 유전자의 서열 등에 대한 정보를 출력할 수 있다. The analysis server 130 may transmit information related to the target gene to the client device 110. Here, the related information includes a target gene identifier, a target gene symbol, and a target gene sequence. The client device 110 may output information on the type of the target gene (Matured RNA, tRNA, rRNA, piRNA, etc.), the sequence of the target gene, and the like.

나아가 분석 서버(130)는 매칭된 결과에 따라 유효한 타겟(Validation Target) 또는 예측되는 타겟(Prediction Target)으로 구분하여 정보를 제공할 수 있다. 한편 분석 서버(130)는 타겟 유전자에 대한 부가적인 정보를 클라이언트 장치(110)에 제공할 수 있다. 부가적인 정보는 매칭 miRNA의 타겟 유전자에 대한 결합력, 결합 정도를 나타내는 점수 등을 포함할 수 있다.Furthermore, the analysis server 130 may provide information by dividing it into a valid target or a prediction target according to the matched result. Meanwhile, the analysis server 130 may provide additional information about the target gene to the client device 110. The additional information may include a binding miRNA binding ability to a target gene, a score indicating the degree of binding, and the like.

유전자 발현 패턴을 분석하면 특정 컨디션에서 높은 발현을 보이는 DEGs(Differentially Expressed Genes)를 얻을 수 있다. 즉 특정 유전자의 발현이 다른 유전자 발현에 미치는 영향을 분석할 수 있다. 이러한 결과를 토대로 유전자 온톨로지 정보를 마련할 수 있다. 유전자 온톨리지(Gene Ontology, GO)와 같이 유기체 내의 모든 유전자를 카테고리화하여 유전자 구성이 어떻게 되는지를 분석하는 것은 유전자의 기능 분석 방법 중 하나이다.By analyzing the gene expression pattern, it is possible to obtain DEGs (Differentially Expressed Genes) that show high expression in a specific condition. That is, it is possible to analyze the effect of the expression of a specific gene on the expression of other genes. Gene ontology information can be prepared based on these results. One of the gene function analysis methods is to analyze how genes are structured by categorizing all genes in an organism, such as Gene Ontology (GO).

도 8(B)는 타겟 유전자로 식별되는 유전자 온톨로지 정보에 대한 예이다. 도 8(B)는 타겟 유전자(Target gene), 타겟 유전자의 온톨로지 도메인(Domain) 및 관련 질병(Disease)을 도시한다. 예컨대, 특정 유전자들이 세포 자멸 과정(apoptotic process)과 관련 있다면, 특정 유전자의 기능은 세포 자멸이라고 분류할 수 있다. 나아가 도메인은 유전자 기능을 보다 포괄적으로 분류한 범주이다. 예컨대, 도메인은 CC(cellular component), MF(molecular function), BP(biological process) 등으로 구분될 수 있다. 질병은 해당 유전자의 기능이 관여하는 특정 질환에 해당한다. 도 8(B)는 Gene 1은 직장암(Colorectal Cancer), Gene2 및 Gene 3은 유방암(Breast Cancer)에 관련된다고 도시한다.8 (B) is an example of gene ontology information identified as a target gene. Figure 8 (B) shows the target gene (Target gene), the ontology domain of the target gene (Domain) and the associated disease (Disease). For example, if certain genes are involved in the apoptotic process, the function of the specific gene can be classified as apoptosis. Furthermore, domains are a more comprehensive category of gene functions. For example, the domain may be divided into a cellular component (CC), a molecular function (MF), and a biological process (BP). A disease is a specific disease in which the function of the gene is involved. 8 (B) shows that Gene 1 is related to Colorectal Cancer, Gene 2 and Gene 3 are related to Breast Cancer.

도 8(C)는 타겟 유전자로 식별되는 유전자 온톨로지 정보에 대한 다른 예이다. 도 8(C)는 타겟 유전자(Target gene), 유전자 온톨로지(Go Term) 및 P 값(P- Value)을 도시한다. 유전자 온톨로지는 해당 유전자의 기능에 대한 정보를 나타낸다. P값은 특정 유전자의 온톨로지를 결정하는 과정에 사용되는 기준 중 하나이다. 도 8(C) 상단에는 원형 그래프로 유전자와 유전자의 온톨로지를 시각적으로 표현한 예를 도시한다.8 (C) is another example of gene ontology information identified as a target gene. 8 (C) shows a target gene, a gene ontology (Go Term), and a P value (P-Value). Gene ontology represents information on the function of the gene. The P value is one of the criteria used in the process of determining the ontology of a specific gene. FIG. 8 (C) shows an example of visually expressing genes and ontology of genes in a circular graph.

도 9는 분석 서버가 추출한 유전자 정보에 대한 다른 예이다. 도 9는 타겟 유전자로 식별되는 패스웨이 정보를 출력하는 예이다. 예컨대, 도 9는 클라이언트 장치(110)가 출력하는 화면 및 인터페이스에 해당할 수 있다. 도 9에서 A 영역은 패스웨이의 종류를 출력한다. 도 9에서 B 영역은 관련된 전체 기작을 트리 형태로 출력한다. B 영역에서 트리의 노드에 해당하는 사각형 박스는 특정 기작에 해당한다. 에지(edge, 실선으로 표시)로 연결되는 노드들은 서로 관련된 기작에 해당한다. 도 9에서 C 영역은 특정 타겟 유전자를 표시하거나, 특정 타겟 유전자를 입력하는 인터페이스에 해당한다. 예컨대, 특정 타겟 유전자를 입력한다면, A 영역은 관련된 패스웨이를 점선 박스로 표시하고, B 영역은 특정 타겟 유전자가 관여하는 기작은 굵은 실선과 점선 박스로 표시할 수 있다.9 is another example of gene information extracted by the analysis server. 9 is an example of outputting pathway information identified as a target gene. For example, FIG. 9 may correspond to a screen and an interface output by the client device 110. In FIG. 9, area A outputs the type of the pathway. In FIG. 9, area B outputs all related mechanisms in a tree form. The rectangular box corresponding to the node of the tree in area B corresponds to a specific mechanism. Nodes connected by edges (indicated by solid lines) correspond to mechanisms related to each other. In FIG. 9, the C region corresponds to an interface for displaying a specific target gene or inputting a specific target gene. For example, if a specific target gene is input, region A may indicate a related pathway as a dotted box, and region B may indicate a mechanism in which a specific target gene is involved as a bold solid line and a dotted box.

도 10은 분석 서버(130)의 구조를 도시한 예이다. 분석 서버(130)는 연산장치(131), 저장장치(132) 및 통신장치(133)를 포함한다. 10 is an example showing the structure of the analysis server 130. The analysis server 130 includes a computing device 131, a storage device 132, and a communication device 133.

저장장치(132)는 전술한 miRNA 데이터를 분석을 위한 프로그램을 저장한다. 저장장치(132)는 분석 서버(130)에 연결된 하드디스크, 플래시 메모리 등일 수 있다. 저장장치(132)는 클라이언트 장치로부터 수신한 타겟 서열 정보, 각종 DB로부터 수신한 쿼리 결과 등도 저장할 수 있다. The storage device 132 stores a program for analyzing the aforementioned miRNA data. The storage device 132 may be a hard disk, flash memory, etc. connected to the analysis server 130. The storage device 132 may also store target sequence information received from the client device, query results received from various DBs, and the like.

연산 장치(131)는 저장장치(132)에 저장된 프로그램을 실행하여 수신한 타겟 서열 정보를 쿼리하여, 매칭 miRNA를 식별한다. 연산 장치(131)는 매칭 miRNA를 기준으로 타겟 유전자를 식별한다. 또 연산 장치(131)는 타겟 유전자를 기준으로 유전자 온톨로지 내지 패스웨이를 식별한다. 각 과정은 전술한 바와 같다.The computing device 131 executes the program stored in the storage device 132 to query the received target sequence information to identify the matching miRNA. The computing device 131 identifies the target gene based on the matching miRNA. In addition, the computing device 131 identifies the gene ontology or the pathway based on the target gene. Each process is as described above.

통신장치(133)는 클라이언트 장치(110)로부터 타겟 서열 정보를 수신한다. 또 통신 장치(133)은 각종 DB로부터 쿼리 결과 및 결과와 관련된 정보를 수신한다. 나아가 통신 장치(133)는 쿼리 결과, 쿼리 결과를 분석한 결과, 각종 부가 정보를 클라이언트 장치(110)에 송신할 수 있다.The communication device 133 receives target sequence information from the client device 110. In addition, the communication device 133 receives query results and information related to the results from various DBs. Furthermore, the communication device 133 may transmit various additional information to the client device 110 as a result of analyzing the query result and the query result.

또한, 상술한 바와 같은 miRNA 데이터를 전처리하는 방법 내지 miRNA 분석 방법은 컴퓨터에서 실행될 수 있는 실행가능한 알고리즘을 포함하는 프로그램(또는 어플리케이션)으로 구현될 수 있다. 상기 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the method for pre-processing miRNA data or the miRNA analysis method as described above may be implemented as a program (or application) including an executable algorithm executable on a computer. The program may be provided stored in a non-transitory computer readable medium.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.A non-transitory readable medium means a medium that stores data semi-permanently and that can be read by a device, rather than a medium that stores data for a short time, such as registers, caches, and memory. Specifically, the various applications or programs described above may be stored and provided in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

본 실시례 및 본 명세서에 첨부된 도면은 전술한 기술에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 전술한 기술의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시례는 모두 전술한 기술의 권리범위에 포함되는 것이 자명하다고 할 것이다.The drawings attached to the present embodiment and the present specification merely show a part of the technical spirit included in the above-described technology, and are easily understood by those skilled in the art within the scope of the technical spirit included in the above-described technical specification and drawings. It will be apparent that all examples and specific examples that can be inferred are included in the scope of the above-described technology.

50 : NGS 분석 장치
100 : miRNA 분석 시스템
110 : 클라이언트 장치
111 : 입력 장치
112 : 연산 장치
113 : 저장 장치
114 : 통신 장치
115 : 출력 장치
110A, 110B, 110C : 클라이언트 장치
130 : 분석 서버
131 : 연산 장치
132 : 저장 장치
133 : 통신 장치
150 : 유전 정보 DB
151 : miRNA DB
152 : 타겟 유전자 DB
153 : 유전자 온톨로지 DB
154 : 패스웨이 DB50: NGS analysis device
100: miRNA analysis system
110: client device
111: input device
112: computing device
113: storage device
114: communication device
115: output device
110A, 110B, 110C: client device
130: analysis server
131: computing device
132: storage device
133: communication device
150: genetic information DB
151: miRNA DB
152: target gene DB
153: Gene ontology DB
154: Pathway DB

Claims

A plurality of client devices that transmit target sequence information by preprocessing raw data analyzed based on NGS (Next Generation Sequencing);
a miRNA database including a miRNA identifier and a sequence of the miRNA identified by the miRNA identifier;
a target gene database including miRNA identifiers and target gene information matching the miRNA identifiers;
An analysis server extracting a target miRNA matched from the miRNA database based on the target sequence information received, and extracting target gene information matched from the target gene database based on the target miRNA;
A gene ontology database including a gene identifier and gene ontology information matching the gene identifier; And
A pathway database including a genetic identifier and pathway information matching the genetic identifier is included.
Each of the plurality of client devices preprocesses raw data for one sample,
The client device selects a plurality of candidate target sequences overlapping a first reference value or more among a plurality of sequences present in the raw data, and uses the values determined based on the order in which the plurality of candidate target sequences are sorted. Normalize the number of duplicates for the sequence of, and the sequence having the largest number of normalized duplicates or the sequence having the normalized number of duplicates equal to or greater than a second reference value is determined as the target sequence information,
The client device determines the first reference value based on a sample characteristic, a preliminary analysis result of raw data, and a system specification, selects the plurality of candidate target sequences based on the first reference value,
The client device divides the number of duplicates of the plurality of sequences by summing the number of repetitions of the sequences belonging to the upper quartile based on the order in which the plurality of candidate target sequences are sorted, and divides the plurality of sequences into values Normalize the number of duplicates by multiplying the average number of duplicates of the entire sequence,
The analysis server includes at least one of a biological function of the target gene, another gene related to the target gene, and a disease related to the target gene in the gene ontology database based on the identifier of the target gene included in the target gene information. Gene ontology information to be extracted,
The analysis server identifies the pathway associated with the target gene in the pathway database based on the identifier of the target gene, and the identifier of the associated pathway in the pathway database and the associated pathway in a specific entire pathway A miRNA analysis system based on variance processing to extract the pathway information including at least one of the acting parts.

delete

According to claim 1,
The client device selects a sequence in which the normalized number of duplicates is greater than 10 as the candidate target sequence, and reads data for the candidate target sequence from the raw data at once to perform count alignment on the candidate target sequence MiRNA analysis system based on dispersion processing.

delete

According to claim 1,
The analysis server is based on the target sequence included in the target sequence information miRNA analysis system based on variance processing to determine the miRNA with the highest matching rate from the sequence stored in the miRNA database as the target miRNA.

According to claim 1,
The analysis server is a miRNA analysis system based on variance processing that identifies the miRNA having the highest matching rate as the target miRNA in the sequence stored in the miRNA database, except for at least one base sequence of the target sequence included in the target sequence information. .

According to claim 1,
The analysis server is a miRNA analysis system based on variance processing that delivers a query command to place at least one base sequence of the target sequence included in the target sequence information as a wild card in the miRNA database.

delete