KR101253700B1

KR101253700B1 - High Speed Encoding Apparatus for the Next Generation Sequencing Data and Method therefor

Info

Publication number: KR101253700B1
Application number: KR1020100118435A
Authority: KR
Inventors: 전영준; 박상현; 안성민; 황희정
Original assignee: 가천대학교 산학협력단
Priority date: 2010-11-26
Filing date: 2010-11-26
Publication date: 2013-04-12
Also published as: KR20120056944A

Abstract

NGS 데이터의 고속 압축장치 및 그 압축방법이 개시된다. 본 발명의 실시예에 따른 NGS 데이터의 고속 압축장치는, NGS(Next Generation Sequencer) 데이터에 대하여 DNA 시퀀스 데이터를 시퀀스 식별자 라인과 베이스 데이터 라인으로 분리하여 각각 서로 다른 압축 방법을 적용하는 데이터 압축부; 및 데이터 압축부에 의한 현재의 데이터 블록의 압축결과와 다음 데이터 블록 사이의 간섭을 최소화하기 위하여 베이스 데이터 라인은 인코드 맵(encode map)을 참조하여 바이너리 데이터(binary data)들로 변환한 후, 바이트(byte) 데이터로 결합하는 간섭 최소화부; 및 압축된 데이터를 병렬로 압축 해제할 수 있도록 분할하여 저장하는 후처리 분할 저장부를 포함하는 것을 특징으로 한다.A high speed compression apparatus for NGS data and a compression method thereof are disclosed. A high speed compression apparatus for NGS data according to an embodiment of the present invention, the data compression unit for applying different compression methods by separating the DNA sequence data into a sequence identifier line and a base data line for the NGS (Next Generation Sequencer) data; And in order to minimize the interference between the current data block by the data compression unit and the next data block, the base data line is converted into binary data by referring to an encode map. An interference minimizing unit that combines byte data; And a post-processing partition storage unit that divides and stores the compressed data so that the compressed data can be decompressed in parallel.

Description

High Speed Encoding Apparatus for the Next Generation Sequencing Data and Method therefor}

본 발명의 실시예는 NGS 데이터(Next Generation Sequencing Data)의 고속 압축장치 및 그 방법에 관한 것이다. 보다 상세하게는, 클라우드 컴퓨팅에서의 NGS 작업준비 단계에서 DNA 시퀀스 데이터의 전송에 적합하도록 NGS 데이터를 고속으로 압축하고 분할할 수 있는 NGS 데이터의 고속 압축장치 및 그 방법에 관한 것이다.
Embodiments of the present invention relate to a high speed compression apparatus and method thereof for NGS data (Next Generation Sequencing Data). More specifically, the present invention relates to a high speed compression apparatus for NGS data and a method for compressing and dividing NGS data at high speed so as to be suitable for transmission of DNA sequence data in the NGS preparation stage in cloud computing.

클라우드 컴퓨팅(Cloud Computing)은 인터넷상의 서버를 통하여 데이터 저장, 네트워크, 컨텐츠 사용 등 IT(Information Technology: 정보기술) 관련 서비스를 한번에 사용할 수 있는 컴퓨팅 환경으로서, 정보가 인터넷상의 서버에 영구적으로 저장되고, 데스크톱, 태블릿 컴퓨터, 노트북, 넷북, 스마트폰 등의 IT 기기 등과 같은 클라이언트에는 일시적으로 보관되는 컴퓨터 환경을 뜻한다. 즉, 이용자의 모든 정보를 인터넷상의 서버에 저장하고, 이 정보를 각종 IT 기기를 통하여 언제 어디서든지 이용할 수 있다는 개념이다. 다시 말하면, 클라우드 컴퓨팅은 구름(cloud)과 같이 무형의 형태로 존재하는 하드웨어, 소프트웨어 등의 컴퓨팅 자원을 자신이 필요한 만큼 빌려쓰고 이에 대한 사용요금을 지급하는 방식의 컴퓨팅 서비스로서, 서로 다른 물리적인 위치에 존재하는 컴퓨팅 자원을 가상화 기술로 통합하여 제공하는 기술을 말한다. 클라우드로 표현되는 인터넷상의 서버에서 데이터 저장, 처리, 네트워크, 콘텐츠 사용 등 IT 관련 서비스를 한번에 제공하는 혁신적인 컴퓨팅 기술인 클라우드 컴퓨팅은 '인터넷을 이용한 IT 자원의 주문형 아웃소싱 서비스'라고 정의되기도 한다.Cloud Computing is a computing environment in which information technology (IT) related services such as data storage, network, and content use can be used at a time through servers on the Internet. Information is stored permanently on servers on the Internet. It refers to a computer environment that is temporarily stored in clients such as desktop devices, tablet computers, laptops, netbooks, and smart phones such as IT devices. In other words, all user information is stored in a server on the Internet, and this information can be used anytime and anywhere through various IT devices. In other words, cloud computing is a computing service that borrows computing resources such as hardware and software that exist in an intangible form, such as a cloud, and pays for them as needed. Refers to a technology that integrates and provides computing resources in a virtualization technology. Cloud computing, an innovative computing technology that provides IT-related services such as data storage, processing, network, and content usage on a server on the Internet represented as a cloud, is sometimes referred to as an on-demand outsourcing service for IT resources using the Internet.

클라우드 컴퓨팅을 도입하면 기업 또는 개인은 컴퓨터 시스템을 유지·보수·관리하기 위하여 들어가는 비용과 서버의 구매 및 설치 비용, 업데이트 비용, 소프트웨어 구매 비용 등 엄청난 비용과 시간·인력을 줄일 수 있고, 에너지 절감에도 기여할 수 있다.By adopting cloud computing, companies or individuals can save tremendous costs, time and labor such as the cost of maintaining, maintaining, and managing computer systems, purchasing and installing servers, updating, and purchasing software. Can contribute.

또한 PC(Personal Computer)에 자료를 보관할 경우 하드디스크 장애 등으로 인하여 자료가 손실될 수도 있지만 클라우드 컴퓨팅 환경에서는 외부 서버에 자료들이 저장되기 때문에 안전하게 자료를 보관할 수 있고, 저장 공간의 제약도 극복할 수 있으며, 언제 어디서든 자신이 작업한 문서 등을 열람·수정할 수 있다. 하지만 서버가 해킹당할 경우 개인정보가 유출될 수 있고, 서버 장애가 발생하면 자료 이용이 불가능하다는 단점도 있다.In addition, if the data is stored on a personal computer (PC), data may be lost due to a hard disk failure, but in a cloud computing environment, the data is stored on an external server so that the data can be safely stored and the limitation of storage space can be overcome. You can view and modify the documents you have worked on anytime, anywhere. However, if the server is hacked, personal information may be leaked, and if a server failure occurs, data may not be available.

최근, 구글·다음·네이버 등과 같은 다양한 포털 사이트(Portal Site)에서 구축한 클라우드 컴퓨팅 환경을 통하여 태블릿컴퓨터나 스마트폰 등 휴대용 IT기기로도 손쉽게 각종 서비스를 사용할 수 있게 되었다. 클라우드 컴퓨팅은 이용편리성이 높고 산업적 파급효과가 커 차세대 인터넷 서비스로 주목받고 있으며, 2000년 대 후반에 들어 새로운 IT 통합관리모델로 등장하였다.Recently, various services such as tablet computers or smartphones can be easily used through cloud computing environments built on various portal sites such as Google, Daum, and Naver. Cloud computing has attracted attention as a next-generation Internet service because of its ease of use and industrial ripple effect, and emerged as a new IT integrated management model in the late 2000s.

한편, 1988년 시작되었던 휴먼 게놈 프로젝트(Human Genome Project)는 인간의 30억 염기쌍인 게놈 시퀀서를 일단 전부 읽어보자는 의미의 장대한 프로젝트이다. 처음엔 무모하게까지 여겨졌던, 하지만 그 가치는 생물학의 '아폴로 계획'으로 불리워질 정도로 소중하게 여겨지며 추진되었고 그 과정에서 게놈 시퀀서를 중심으로 다양한 기술의 발전이 이루어졌다. 그 결과 예상보다 빨리 2003년에 프로젝트의 완성이 선언되었고 전세계 과학자들은 인간의 완전한 염기서열 정보를 함께 사용하여 생물정보학 연구를 진행시키고 있다. 그리고 지금 생명과학에서는 이젠 정말로 한 사람 한 사람의 염기서열을 해독하여 그 안에서 각 개인의 건강을 향상시킬 수 있는 정보를 얻으려는 생각까지 하게 된 것이다.Meanwhile, the Human Genome Project, which began in 1988, is a grand project meant to read the entire genome sequencer of 3 billion base pairs in humans. Though initially considered reckless, its value was deemed so important that it was called the Apollo Plan of Biology, and the development of various technologies centered around the genome sequencer. As a result, the project was announced in 2003 sooner than expected, and scientists around the world are working on bioinformatics research using the complete sequence of human information. And now, in life sciences, we are really thinking about deciphering the sequences of each person and getting information in them that can improve the health of each individual.

여기서 가장 큰 장벽의 하나는 30억 염기쌍을 해독하는데 필요한 비용과 시간이다. 88년부터 시작해 국제적인 프로젝트로 2003년에야 겨우 한 사람분의 게놈을 해독했다는 사실에서도 알 수 있다. 하지만 현재 과학자들은 가까운 미래에 $1000 정도로 한 사람의 게놈 해독이 가능하게 될 것으로 예상한다. 그 이유는 자꾸 새로운 기술이 등장하며 게놈 시퀀서의 성능이 비약적으로 발전하고 있기 때문이며, 최근 차세대 게놈 시퀀서라고 불리는 상품들이 계속해서 등장하고 있다. One of the biggest barriers here is the cost and time required to decipher 3 billion base pairs. It is also clear from the fact that starting in 1988, it was an international project that only a single genome was deciphered in 2003. But scientists now expect to be able to decipher one person's genome in the near future for about $ 1000. The reason for this is that new technologies are constantly emerging and the performance of genome sequencers is quantum leaping. Recently, products called the next generation genome sequencers continue to appear.

최근에는 대용량의 NGS(Next Generation Sequencer: 차세대 게놈 시퀀서) 데이터를 효과적으로 처리하기 위한 방안으로 클라우드 컴퓨팅을 활용하는 방안들이 연구되고 있다. Recently, researches on using cloud computing to efficiently process a large amount of Next Generation Sequencer (NGS) data have been studied.

그런데, 수백 기가(Giga)에 달하는 NGS 데이터를 아마존의 EC2, EBS 등과 같은 클라우드 저장소(Cloud Storage)에 전송하기 위해서는 많은 시간과 비용이 필요하다. 클라우드 컴퓨팅 환경과 NGS 간의 결합과정에서 중요한 요소는 장기간의 보관을 위한 DNA(Deoxyribonucleic acid) 시퀀스의 압축률뿐만 아니라 전송 속도와 압축 시간과 같은 시간 요소(time factor)이다. 따라서 최대한 적은 작업시간으로 DNA 시퀀스 데이터에 대한 처리 방안이 모색될 필요가 있다.
However, transferring hundreds of gigabytes of NGS data to cloud storage such as Amazon's EC2 and EBS requires a lot of time and money. An important factor in the coupling between cloud computing environment and NGS is not only the compression rate of DNA (Deoxyribonucleic acid) sequence for long term storage but also the time factors such as transmission speed and compression time. Therefore, processing methods for DNA sequence data need to be searched for as little working time as possible.

본 발명의 실시예는 전술한 요구에 부응하기 위하여 창안된 것으로서, 클라우드 컴퓨팅에서의 NGS 작업준비 단계에서 DNA 시퀀스 데이터의 전송에 적합하도록 NGS 데이터를 고속으로 압축하고, 복호화를 병렬로 수행할 수 있도록 데이터를 분할할 수 있는, NGS 데이터의 고속 압축장치 및 그 방법을 제공하는 것을 목적으로 한다.
Embodiments of the present invention have been devised to meet the above-described needs, and can be used to compress NGS data at high speed and to perform decoding in parallel so as to be suitable for transmission of DNA sequence data in the NGS preparation stage in cloud computing. An object of the present invention is to provide a high speed compression apparatus for NGS data and a method thereof, which can divide data.

전술한 목적을 달성하기 위한 본 발명의 실시예에 따른 NGS 데이터의 고속 압축장치는, NGS(Next Generation Sequencer) 데이터에 대하여 DNA 시퀀스 데이터를 시퀀스 식별자 라인과 베이스 데이터 라인으로 분리하여 각각 서로 다른 압축 방법을 적용하는 데이터 압축부; 및 데이터 압축부에 의한 현재의 데이터 블록의 압축결과와 다음 데이터 블록 사이의 간섭을 최소화하기 위하여 베이스 데이터 라인은 인코드 맵(encode map)을 참조하여 바이너리 데이터(binary data)들로 변환한 후, 바이트(byte) 데이터로 결합하는 간섭 최소화부를 포함하는 것을 특징으로 한다.In the high-speed compression apparatus of NGS data according to the embodiment of the present invention for achieving the above object, different compression method by separating the DNA sequence data into a sequence identifier line and a base data line for NGS (Next Generation Sequencer) data Data compression unit for applying; And in order to minimize the interference between the current data block by the data compression unit and the next data block, the base data line is converted into binary data by referring to an encode map. And an interference minimizing unit for combining the byte data.

여기서, 시퀀스 식별자 라인은 설정된 값 이상의 빈도로 반복되는 부분일 수 있다. 이 경우, 데이터 압축부는, 시퀀스 식별자 라인을 LZMA(Lempel-Ziv-Markov chain Algorithm)나 gzip, bzip2와 같은 범용 압축법을 이용하여 압축할 수 있다.Here, the sequence identifier line may be a portion that is repeated at a frequency equal to or greater than a set value. In this case, the data compression unit may compress the sequence identifier line using a general-purpose compression method such as LZMA (Lempel-Ziv-Markov chain Algorithm), gzip, bzip2, or the like.

또한, 베이스 데이터 라인은 랜덤한 패턴 특성을 갖는 ACGT(Adenine Cytosine Guanine Thymine)로 구성된 부분일 수 있다. 이 경우, 데이터 압축부는, 베이스 데이터 라인을 비트(bit) 연산 수준의 압축을 이용하여 압축할 수 있다.
In addition, the base data line may be a part composed of Adenine Cytosine Guanine Thymine (ACGT) having random pattern characteristics. In this case, the data compression unit may compress the base data line by using a bit operation level compression.

본 발명의 실시예에 따르면, 클라우드 컴퓨팅에서의 NGS 작업준비 단계에서 DNA 시퀀스 데이터의 전송에 적합하도록 NGS 데이터를 고속으로 압축할 수 있게 된다. 또한, 압축 결과물은 클라우드 컴퓨팅에서 병렬로 복호화할 수 있도록 분할할 수 있다.
According to an embodiment of the present invention, it is possible to compress the NGS data at high speed so as to be suitable for transmission of DNA sequence data in the NGS preparation step in cloud computing. Compression results can also be partitioned for decryption in parallel in cloud computing.

도 1은 본 발명의 실시예에 따른 NGS 데이터의 고속 압축장치를 개략적으로 도시한 도면이다.
도 2는 SOLiDzipper를 이용한 압축처리의 예를 나타낸 도면이다.
도 3은 서로 다른 압축방법을 이용한 경우의 압축 효율 및 시간을 비교한 결과를 나타낸 도면이다.
도 4는 데이터 전송속도 및 압축 방법에 따른 Rt의 변화를 나타낸 도면이다.
도 5는 본 발명의 실시예에 따른 NGS 데이터의 고속 압축방법을 나타낸 흐름도이다.1 is a view schematically showing a high speed compression apparatus of NGS data according to an embodiment of the present invention.
2 is a diagram illustrating an example of a compression process using SOLiDzipper.
3 is a diagram illustrating a result of comparing compression efficiency and time when different compression methods are used.
4 is a diagram illustrating a change in Rt according to a data transmission rate and a compression method.
5 is a flowchart illustrating a fast compression method of NGS data according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. 이하의 설명에 있어서, 당업자에게 주지 저명한 기술에 대해서는 그 상세한 설명을 생략할 수 있다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, a detailed description of known techniques well known to those skilled in the art may be omitted.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 동일한 명칭의 구성 요소에 대하여 도면에 따라 다른 참조부호를 부여할 수도 있으며, 서로 다른 도면임에도 불구하고 동일한 참조부호를 부여할 수도 있다. 그러나, 이와 같은 경우라 하더라도 해당 구성 요소가 실시예에 따라 서로 다른 기능을 갖는다는 것을 의미하거나, 서로 다른 실시예에서 동일한 기능을 갖는다는 것을 의미하는 것은 아니며, 각각의 구성 요소의 기능은 해당 실시예에서의 각각의 구성요소에 대한 설명에 기초하여 판단하여야 할 것이다.In describing the constituent elements of the present invention, the same reference numerals may be given to constituent elements having the same name, and the same reference numerals may be given thereto even though they are different from each other. However, even in such a case, it does not mean that the corresponding component has different functions according to the embodiment, or does not mean that the different components have the same function. It should be judged based on the description of each component in the example.

또한, 본 발명의 실시예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략할 수 있다.In the following description of the embodiments of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In addition, in describing the component of this invention, terms, such as 1st, 2nd, A, B, (a), (b), can be used. These terms are intended to distinguish the constituent elements from other constituent elements, and the terms do not limit the nature, order or order of the constituent elements. When a component is described as being "connected", "coupled", or "connected" to another component, the component may be directly connected or connected to the other component, Quot; may be "connected," "coupled," or "connected. &Quot;

도 1은 본 발명의 실시예에 따른 NGS 데이터의 고속 압축장치를 개략적으로 도시한 도면이다. 본 발명의 실시예에 따른 NGS 데이터의 고속 압축장치(100)는 데이터 압축부(110) 및 간섭 최소화부(120)를 포함한다. 또한, 간섭 최소화부(120)의 저장방식으로 인해, 압축 결과물은 클라우드 컴퓨팅에서 병렬로 복호화할 수 있도록 분할할 수 있다.1 is a view schematically showing a high speed compression apparatus of NGS data according to an embodiment of the present invention. The high speed compression apparatus 100 for NGS data according to an embodiment of the present invention includes a data compression unit 110 and an interference minimization unit 120. In addition, due to the storage method of the interference minimizing unit 120, the compression result may be divided to be decoded in parallel in cloud computing.

데이터 압축부(110)는 NGS 데이터에 대하여 DNA 시퀀스 데이터를 시퀀스 식별자 라인과 베이스 데이터 라인으로 분리하여 각각 서로 다른 압축 방법을 적용한다. 여기서, 시퀀스 식별자 라인은 설정된 값 이상의 빈도로 반복되는 부분일 수 있다. 이 경우, 데이터 압축부(110)는 시퀀스 식별자 라인을 LZMA(Lempel-Ziv-Markov chain Algorithm)나 gzip 혹은 bzip2와 같은 범용 압축법을 이용하여 압축할 수 있다. 또한, 베이스 데이터 라인은 랜덤한 패턴 특성을 갖는 ACGT(Adenine Cytosine Guanine Thymine)나 0123(color space)로 구성된 부분이다. 이 경우, 데이터 압축부(110)는, 베이스 데이터 라인을 비트(bit) 연산 수준의 압축을 이용하여 압축할 수 있다. 그러나, 압축 방법은 기재된 방법들에 한정되는 것은 아니며, 다양한 압축 방법이 적용 가능하다.The data compression unit 110 separates DNA sequence data into sequence identifier lines and base data lines for NGS data, and applies different compression methods. Here, the sequence identifier line may be a portion that is repeated at a frequency equal to or greater than a set value. In this case, the data compression unit 110 may compress the sequence identifier line using a general compression method such as LZMA (Lempel-Ziv-Markov chain Algorithm), gzip or bzip2. In addition, the base data line is a part composed of ACGT (Adenine Cytosine Guanine Thymine) or 0123 (color space) having random pattern characteristics. In this case, the data compression unit 110 may compress the base data line by using a bit operation level compression. However, the compression method is not limited to the described methods, and various compression methods are applicable.

인코딩의 압축 전략은 크게 두 가지이다. 첫 번째 잔략은 DNA 시퀀스 데이터에서 손쉽게 압축이 될만한 부분과 그렇지 않은 영역을 분리하는 것이다. 일반적으로 손쉽게 압축이 되는 영역은 태그 아이디나 시퀀스 넘버와 같이 높은 빈도로 반복되는 부분이며, 그렇지 않은 부분은 랜덤(random)한 패턴 특성을 갖는 DNA 시퀀스 부분이다. 그러므로 데이터의 영역에 따라 인코딩 방법을 다르게 적용하는 것이 바람직하다. 이때, 평문(plain text; 암호화하지 않은 데이터) 부분은 LZMA(Lempel-Ziv-Markov chain Algorithm)나 gzip와 같은 평문을 우수한 효율로 압축하는 범용 압축 알고리즘을 사용하여 압축할 수 있고, ACGT(Adenine Cytosine Guanine Thymine)로 구성된 베이스(base) 부분과 고정된 숫자 범위로 구성된 품질값(quality value) 부분은 bit 연산 수준의 인코딩을 사용할 수 있다.There are two main compression strategies for encoding. The first trick is to separate parts of the DNA sequence data that are easily compressed from those that are not. In general, a region that is easily compressed is a portion that is repeated at a high frequency such as a tag ID or a sequence number, and a portion that is not easily is a DNA sequence portion having random pattern characteristics. Therefore, it is desirable to apply the encoding method differently according to the area of data. At this time, the plain text portion can be compressed using a general-purpose compression algorithm that compresses plain text such as LZMA (Lempel-Ziv-Markov chain Algorithm) or gzip with excellent efficiency, and uses ACGT (Adenine Cytosine). The base part consisting of Guanine Thymine and the quality value part consisting of a fixed number range can use bit operation level encoding.

두 번째 전략은 저장 장치에서 읽어들인 데이터 블록의 압축 결과와 다음 블록 처리간의 간섭을 최소화하는 것이다. 일반적으로, 압축에 사용하는 데이터 사전의 크기가 커질수록 압축 효율이 증가할 수 있다. 가령 최근의 DNA 압축 방법인 G-SQZ의 경우는 허프만 부호화(Huffman coding) 방법을 사용해서 DNA 시퀀스를 80% 정도로 압축할 수 있다. 그러나 G-SQZ는 부호화 과정에서 허프만 트리(Huffman tree)를 만들어야 하는데, 분석데이터의 알파벳 길이가 증가할수록 허프만 트리의 크기 또한 늘어난다. DNA 데이터의 특성상, 수백 기가 바이트(GByte) 규모의 데이터를 압축하기 위해서는 막대한 크기의 허프만 트리를 로딩할 물리적인 메모리 공간이 필요하다. 그러나 압축 크기 증가에 따른 데이터 사전 증가는 다음 데이터 블록 처리를 위한 데이터 사전 크기 증가만큼이나 데이터 사전탐색 시간 또한 증가할 수 있는 문제가 있다.The second strategy is to minimize the interference between the compression result of the data blocks read from storage and the next block processing. In general, as the size of the data dictionary used for compression increases, the compression efficiency may increase. For example, G-SQZ, a recent DNA compression method, can compress DNA sequences by about 80% using Huffman coding. However, G-SQZ needs to make Huffman tree during encoding process. As the alphabet length of the analysis data increases, the size of Huffman tree also increases. Due to the nature of DNA data, compressing hundreds of GBytes of data requires physical memory space to load a huge Huffman tree. However, the data dictionary increase due to the increase in the compression size has a problem that the data dictionary search time can be increased as much as the data dictionary size for the next data block processing.

이를 위해, 간섭 최소화부(120)는 데이터 압축부(110)에 의한 현재의 데이터 블록의 압축결과와 다음 데이터 블록 사이의 간섭을 최소화하기 위하여 베이스 데이터 라인은 인코드 맵(encode map)을 참조하여 바이너리 데이터(binary data)들로 변환한 후, 바이트(byte) 데이터로 결합한다. To this end, the interference minimizing unit 120 may refer to an encode map of the base data line in order to minimize the interference between the current data block and the next data block by the data compression unit 110. After converting into binary data, combine them into byte data.

NGS 데이터의 고속 압축은 DNA 시퀀스의 베이스 데이터나 품질 값이 제한된 범위에서 변화하는 특성을 사용하여 문자 중심의 비트(bit) 연산을 수행한다. 예를 들어, CSFASTA/QUAL 포맷인 csfasta의 경우에는 컬러 스페이스(color space) 문자 '0,1,2,3'과 같이 2²개의 값의 변화를 비트 값으로 매핑하는데 2 비트의 공간이 필요하다. 컬러 스페이스의 1 바이트(byte) 문자 '0'는 이진데이터 '00', '1'는 '01', '2'는 '10', 그리고 '3'은 '11'로 매핑한다고 가정한다. 컬러 스페이스의 4 바이트 데이터 '0113'은 비트 연산 후 하나의 바이트 공간으로 시프트 동작(shift operation)을 수행하여 1 바이트 문자 '00 01 01 11' = 0x17로 부호화된다. CSFASTA/QUAL의 품질 값은 최소값 -1이고 최대 40의 값을 허용한다. 따라서 2⁶의 값의 변화를 허용하는 품질 값을 부호화 시 6 비트의 공간이 필요하다. 이전 연산에서 남는 2 비트의 공간은 다른 품질 값을 분할하여 저장한다.Fast compression of NGS data performs character-centric bit operations using characteristics that vary in a limited range of base data or quality values of DNA sequences. For example, csfasta, in CSFASTA / QUAL format, requires 2 bits of space to map the change of 2 ² values into a bit value, such as the color space characters '0,1,2,3'. . It is assumed that one byte character '0' of the color space is mapped to binary data '00', '1' is '01', '2' is '10', and '3' is mapped to '11'. The four-byte data '0113' of the color space is encoded into a one-byte character '00 01 01 11 '= 0x17 by performing a shift operation to one byte space after a bit operation. The quality value of CSFASTA / QUAL has a minimum value of -1 and a maximum of 40 values. Therefore, a 6-bit space is required when encoding a quality value that allows a change in the value of 2 ⁶ . The two bits of space left in the previous operation are divided and stored in different quality values.

도 2는 SOLiDzipper를 이용한 압축처리의 예를 나타낸 도면이다. SOLiDzipper에서는, 데이터 판독부(read block)에서 데이터를 읽으며, 선처리부(preprocess block)가 전술한 바와 같이, NGS 데이터에 대하여 DNA 시퀀스 데이터를 시퀀스 식별자 라인과 베이스 데이터 라인으로 분리한다. 주처리부(mainprocess block)는 시퀀스의 ID 및 평문의 번호 등과 같이 평문 정보를 범용 압축법으로 압축할 수 있다. 이때, 품질 값을 1 바이트로 변환하거나 3 바이트에 4 품질 값을 할당하는 등의 품질 값 압축, ACGT 또는 0123에 2 비트를 할당하거나 1 바이트에 4 기저를 결합하는 csfasta 압축 등이 이루어질 수 있다. 일반적인 압축 방법과 달리, SOLiDzipper는 압축 사전 설계(scheme)를 사용하지 않으며, 따라서 컴퓨팅 리소스 요구 및 사전 탐색 시간을 최소화할 수 있다.2 is a diagram illustrating an example of a compression process using SOLiDzipper. In SOLiDzipper, data is read by a data read block, and a preprocessing block separates DNA sequence data into sequence identifier lines and base data lines for NGS data as described above. The main processing block may compress plain text information such as an ID of a sequence and a plain text number using a general compression method. At this time, the quality value compression such as converting the quality value into 1 byte or assigning 4 quality values to 3 bytes may be performed, and the csfasta compression may be performed to allocate 2 bits to ACGT or 0123 or combine 4 bases into 1 byte. Unlike conventional compression methods, SOLiDzipper does not use a compression dictionary scheme, thus minimizing computing resource requirements and pre seek time.

도 3은 서로 다른 압축방법을 이용한 경우의 압축 효율 및 시간을 비교한 결과를 나타낸 도면이다.3 is a diagram illustrating a result of comparing compression efficiency and time when different compression methods are used.

NGS 고속 압축의 구현으로 DNA 시퀀스 데이터 압축 실험을 통해 얻고자 하는 점을 요약하면 다음과 같다.In summary, the implementation of NGS high-speed compression is to obtain the DNA sequence data compression experiment.

a) fasta와 같이 필드 값들이 제한된 범위에서 변화하는 포맷을 전통적인 압축 메커니즘과 다른 방식으로 압축을 하면 어떠한 차이점이 있는가a) What is the difference between compressing a format in which field values change over a limited range, such as fasta, in a different way than traditional compression mechanisms?

b) 부호화 방법을 전통적인 압축 방법과 혼합했을 때의 압축 결과 차이가 있는가b) are there any differences in the compression results when mixing coding methods with traditional compression methods?

이를 위하여 NGS 고속 압축은 자바 1.6 커맨드 라인(java 1.6 command-line) 형태로 64 비트 리눅스(Linux) 머신(Linux: 2.6.29.4-167.fc11.x86_64 Fedora 11 64bit, Intel(R) Core(TM)2 Duo CPU E8400 3.00GHz, 4Gbyte memory)에서 구현하였으며 실험 역시 동일한 환경에서 진행하였다. 실험을 위한 압축 대상은 SOLiD 플랫폼(Platform)에서 생성한 CSFASTA/QUAL 포맷의 데이터를 선택하였다. 그 이유는 CSFASTA/QUAL의 구조적인 특성으로 인해 같은 내용의 FASTAQ에 비해 데이터의 크기가 훨씬 더 클 뿐만 아니라 압축 시 상대적인 압축 효과 또한 높기 때문이다.For this purpose, NGS Fast Compression is a 64-bit Linux machine in the form of a Java 1.6 command-line (Linux: 2.6.29.4-167.fc11.x86_64 Fedora 11 64bit, Intel (R) Core (TM)). 2 Duo CPU E8400 3.00GHz, 4Gbyte memory) and the experiment was conducted in the same environment. The compression target for the experiment was to select data in the CSFASTA / QUAL format generated on the SOLiD platform. The reason for this is that, due to the structural characteristics of CSFASTA / QUAL, the data size is much larger than that of FASTAQ with the same contents, and the relative compression effect is high during compression.

압축 비교 실험은 범용 압축 알고리즘인 gzip(version 1.3.12)와 LZMA(version 4.65)의 고속 압축 옵션(mx=1)과 LZMA의 최고 압축효율 옵션(mx=9) 및최근의 DNA 시퀀스 압축 방법인 G-SQZ(version 0.6)을 사용하였다.Compression comparison experiments include the fast compression option (mx = 1) of gzip (version 1.3.12) and LZMA (version 4.65), the highest compression efficiency option (mx = 9) of LZMA, and the latest DNA sequence compression method. G-SQZ (version 0.6) was used.

도 4는 데이터 전송속도 및 압축 방법에 따른 Rt의 변화를 나타낸 도면이다.4 is a diagram illustrating a change in Rt according to a data transmission rate and a compression method.

그 결과는 DNA 데이터에 대한 부호화 방법을 기존의 DNA 압축 방법이나 범용 압축방법과 비교한 압축 성능의 차이를 나타낸다. The results show the difference in compression performance when the coding method for DNA data is compared with the conventional DNA compression method or general purpose compression method.

DNA 압축의 일반적인 목적은 대량 데이터 압축에 따른 장기간의 저장 비용 감소이다. 따라서 DNA 압축은 압축 효율이 중요한 요소일 수 있다. 그러나 클라우드 컴퓨팅 환경과 NGS의 결합의 경우에서는, DNA 시퀀스의 압축률만큼이나 중요한 요소로 클라우드 컴퓨팅 환경으로의 전송 속도 및 압축시간과 같은 시간 요소를 들 수 있다. 클라우드 컴퓨팅으로의 작업 준비시간(Ready to job time: Rt)은 압축, 전송 및 압축해제 시간의 합을 뜻한다. 만일 DNA 시퀀스 압축에서 시간 요소를 고려하지 않는다면 압축 및 압축해제 시간 증가에 따른 Rt의 증가가 발생한다. 따라서 Rt가 증가한다면 클라우드 컴퓨팅을 사용한 NGS 작업시간 단축이라는 이점이 사라지게 되는 것이다. DNA 시퀀스 데이터의 클라우드 컴퓨팅으로의 Rt는 수학식 1과 같이 풀어낼 수 있다.The general purpose of DNA compression is to reduce the long term storage costs associated with bulk data compression. Therefore, DNA compression may be an important factor in compression efficiency. However, in the case of the combination of cloud computing environment and NGS, as important factors as the compression ratio of DNA sequence are time factors such as transmission speed and compression time to cloud computing environment. Ready to job time (Rt) is the sum of compression, transmission, and decompression time. If the time factor is not taken into account in DNA sequence compression, an increase in Rt occurs with increasing compression and decompression time. Thus, if Rt increases, the benefit of reducing NGS work time using cloud computing is lost. Rt of the DNA sequence data to cloud computing can be solved as in Equation 1.

[수학식 1][Equation 1]

수학식 1에서 SOLiDTM 3.5에서 생성한 133 GigaByte의 데이터를 도 3의 압축 결과에 대입하고 전송속도에 따른 Rt 변화를 나타내면 도 4와 같다. 도 4에서는 지면 상의 문제로 400 시간 이상의 결과는 잘라내었다. In Equation 1, 133 GigaByte data generated by SOLiDTM 3.5 is substituted into the compression result of FIG. 3 and the change in Rt according to the transmission rate is shown in FIG. 4. In FIG. 4, the result of 400 hours or more was cut out due to a problem on the ground.

도 4에 따르면, 클라우드 컴퓨팅 환경으로의 전송시 기가비트(gigabit) 급의 전송속도를 낼 수 있는 전용망을 갖춘 연구소나 기업을 제외한다면 일반적인 전송환경에서는 시간 요소를 고려하여 고속 압축하는 것이 최선의 방안임을 알 수 있다. According to FIG. 4, except for a research institute or a company having a dedicated network capable of delivering a gigabit transmission rate to a cloud computing environment, high-speed compression is considered the best method in consideration of time factors in a general transmission environment. Able to know.

부호화 방법의 대표적인 특성을 정리하면 다음과 같다.Representative characteristics of the encoding method are as follows.

a) 알려져 있는 텍스트 형태의 시퀀스 포맷만을 지원한다.a) Only supports known text format sequence formats.

b) 압축 대상의 크기가 수 기가바이트에서 수 테라바이트(Terabytes)로 선형적으로 증가하면 결과시간 또한 선형적으로 증가한다. 또한, 동일 포맷의 압축 시에 데이터의 크기가 변하더라도 압축률의 변화가 적다.b) As the size of the compression target increases linearly from several gigabytes to several terabytes, the resulting time also increases linearly. In addition, even if the size of data changes during compression of the same format, the change in compression rate is small.

c) 저장된 압축파일을 고정된 블록 단위로 저장하기 때문에 파일의 전체 압축 해제 없이 특정 위치의 데이터를 복원할 수 있다.c) Since the stored compressed files are stored in fixed block units, data at a specific location can be restored without decompressing the entire file.

부호화는 DNA 시퀀스 데이터의 제한된 값의 변화 특성을 사용한 바이트 투 멀티비트 압축(byte to multi-bit encode)이기 때문에 DNA 시퀀스 포맷을 제외한 일반적인 평문의 압축 적용에는 제한이 있다. 그러나 부호화는 향후 품질 값이나 컬러 스페이스가 멀티 바이트(multi-byte)를 갖도록 확장된다고 해도 저장방식에 남는 공간이 있다면 부호화의 효과 또한 여전히 기대할 수 있다. 부호화는 클라우드 컴퓨팅을 활용한 NGS 연구에 도움을 주기 위한 목적으로 설계되었다. 그러나 클라우드 환경으로의 전송 이후 장기간의 데이터 보존에 따른 비용 절감은 본 발명의 주제를 벗어난다. 본 발명의 실시예는 원격지에서 생성한 막대한 양의 DNA 시퀀스 데이터를 클라우드 컴퓨팅으로 네트워크에 전송시에 최소한의 비용으로 압축, 전송 및 해제하는 현실적인 대안을 제시하였다. 따라서 클라우드 컴퓨팅으로 데이터 이동시, 기존의 DNA 압축방법에 비교해 수 %의 압축 효율을 희생하는 대신에 작업 대기시간을 극적으로 줄일 수 있는 최선의 타협(trade-off) 가능성을 보여주었다.Since encoding is byte to multi-bit encoding using a limited value change of DNA sequence data, there is a limitation in general plain text compression application except for the DNA sequence format. However, even if the encoding is extended to have a multi-byte quality value or color space in the future, the effect of the encoding can still be expected if there is space left in the storage method. Encoding is designed to help research NGS with cloud computing. However, cost savings due to long term data retention after transfer to the cloud environment are beyond the subject of the present invention. Embodiments of the present invention have provided a realistic alternative to compressing, transmitting and decompressing massive amounts of DNA sequence data generated remotely to a network with cloud computing at minimal cost. Thus, when moving data to cloud computing, it showed the best possible trade-off of dramatically reducing work latency at the expense of several percent compression efficiency compared to conventional DNA compression methods.

도 5는 도 1의 NGS 데이터의 고속 압축장치에 의한 NGS 데이터의 고속 압축방법을 나타낸 흐름도이다.FIG. 5 is a flowchart illustrating a high speed compression method of NGS data by the high speed compression device of the NGS data of FIG. 1.

도 1 및 도 5를 참조하면, 데이터 압축부(110)는 NGS 데이터에 대하여 DNA 시퀀스 데이터를 시퀀스 식별자 라인과 베이스 데이터 라인으로 분리한다(S501). 또한, 데이터 압축부(110)는 분리된 시퀀스 식별자 라인 및 베이스 데이터 라인에 각각 서로 다른 압축 방법을 적용하여 압축한다(S503). 여기서, 시퀀스 식별자 라인은 설정된 값 이상의 빈도로 반복되는 부분일 수 있다. 이 경우, 데이터 압축부(110)는 시퀀스 식별자 라인을 범용 압축방법을 이용하여 압축할 수 있다. 또한, 베이스 데이터 라인은 랜덤한 패턴 특성을 갖는 ACGT(Adenine Cytosine Guanine Thymine)로 구성된 부분일 수 있다. 이 경우, 데이터 압축부(110)는, 베이스 데이터 라인을 비트(bit) 연산 수준의 압축을 이용하여 압축할 수 있다. 그러나, 압축 방법은 기재된 방법들에 한정되는 것은 아니며, 다양한 압축 방법이 적용 가능하다.1 and 5, the data compression unit 110 separates DNA sequence data into sequence identifier lines and base data lines for NGS data (S501). In addition, the data compression unit 110 compresses by applying different compression methods to the separated sequence identifier line and the base data line, respectively (S503). Here, the sequence identifier line may be a portion that is repeated at a frequency equal to or greater than a set value. In this case, the data compression unit 110 may compress the sequence identifier line using a general purpose compression method. In addition, the base data line may be a part composed of Adenine Cytosine Guanine Thymine (ACGT) having random pattern characteristics. In this case, the data compression unit 110 may compress the base data line by using a bit operation level compression. However, the compression method is not limited to the described methods, and various compression methods are applicable.

간섭 최소화부(120)는 데이터 압축부(110)에 의한 현재의 데이터 블록의 압축결과와 다음 데이터 블록 사이의 간섭을 최소화하기 위하여 베이스 데이터 라인은 인코드 맵(encode map)을 참조하여 바이너리 데이터(binary data)들로 변환한 후, 바이트(byte) 데이터로 결합한다(S505). 후처리 저장부(140)는 간섭 최소화부(120)의 변환 데이터를 분할 저장부(130)를 사용해 분리하여 저장할 수 있다.The interference minimizing unit 120 uses the binary data (reference) to reference the encode map in order to minimize the interference between the current data block by the data compression unit 110 and the next data block. binary data), and then combine into byte data (S505). The post processing storage 140 may separate and store the converted data of the interference minimization unit 120 using the division storage 130.

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 기재되어 있다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 또한, 이와 같은 컴퓨터 프로그램은 USB 메모리, CD 디스크, 플래쉬 메모리 등과 같은 컴퓨터가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 저장매체로서는 자기 기록매체, 광 기록매체, 캐리어 웨이브 매체 등이 포함될 수 있다.The present invention is not necessarily limited to these embodiments, as all the constituent elements constituting the embodiment of the present invention are described as being combined or operated in one operation. In other words, within the scope of the present invention, all of the components may be selectively operated in combination with one or more. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. In addition, such a computer program may be stored in a computer-readable medium such as a USB memory, a CD disk, a flash memory, etc., and read and executed by a computer, thereby implementing embodiments of the present invention. As the storage medium of the computer program, a magnetic recording medium, an optical recording medium, a carrier wave medium, or the like may be included.

또한, 기술적이거나 과학적인 용어를 포함한 모든 용어들은, 상세한 설명에서 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Furthermore, all terms including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined in the Detailed Description. Terms used generally, such as terms defined in a dictionary, should be interpreted to coincide with the contextual meaning of the related art, and shall not be interpreted in an ideal or excessively formal sense unless explicitly defined in the present invention.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이며, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 따라서, 본 발명의 보호 범위는 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. Therefore, the protection scope of the present invention should be interpreted by the claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

100: NGS 데이터의 고속 압축장치
110: 데이터 압축부 120: 간섭 최소화부
130: 데이터 분할 저장부 140: 후처리 저장부100: high speed compression device of NGS data
110: data compression unit 120: interference minimizer
130: data division storage 140: post-processing storage

Claims

A data compression unit which separates DNA sequence data into sequence identifier lines and base data lines and applies different compression methods to NGS (Next Generation Sequencer) data;
In order to minimize the interference between the current data block by the data compression unit and the next data block, the base data line is converted into binary data by referring to an encode map. An interference minimizing unit for combining byte data; And
Post-processing split storage that splits and stores compressed data so that it can be decompressed in parallel
High speed compression device of the NGS data comprising a.

The method of claim 1,
The sequence identifier line is a fast compression apparatus for NGS data, characterized in that the portion having a characteristic that is repeated at a frequency of more than a predetermined value, such as a tag ID or sequence number.

3. The method according to claim 1 or 2,
The data compression unit,
And compressing the sequence identifier line using a general-purpose compression method.

The method of claim 1,
The base data line is a high-speed compression device for NGS data, characterized in that the part consisting of adenine Cytosine Guanine Thymine (ACGT) or 0123 (color space) having a random pattern characteristics.

The method according to claim 1 or 4,
The data compression unit,
And compressing the base data line using a bit operation level compression.

Separating DNA sequence data into sequence identifier lines and base data lines for NGS (Next Generation Sequencer) data;
Compressing by applying different compression methods to the sequence identifier line and the base data line, respectively; And
In order to minimize the interference between the current data block by the data compression unit and the next data block, the base data line is converted into binary data by referring to an encode map. Combining into byte data
Fast compression method of the NGS data, characterized in that it comprises a.

The method according to claim 6,
The sequence identifier line is a fast compression method of the NGS data, characterized in that the part having a characteristic that is repeated at a frequency of more than a predetermined value, such as a tag ID or sequence number.

8. The method according to claim 6 or 7,
The compression step,
And compressing the sequence identifier line using a general purpose compression method.

The method according to claim 6,
The base data line is a high-speed compression method of NGS data, characterized in that the part consisting of adenine Cytosine Guanine Thymine (ACGT) or 0123 (color space) having a random pattern characteristics.

10. The method according to claim 6 or 9,
The compression step,
And compressing the base data line using a bit operation level compression.