KR100996443B1

KR100996443B1 - System and method of parallel distributed processing of gpu by dividing dense indexed data-files into parts of search and computation in query and database system thereof

Info

Publication number: KR100996443B1
Application number: KR1020100034001A
Authority: KR
Inventors: 정종선
Original assignee: (주)신테카바이오
Priority date: 2010-04-13
Filing date: 2010-04-13
Publication date: 2010-11-24

Abstract

PURPOSE: A graphic processor-based parallel distributed processing system and a method thereof by dividing calculation function with a search of a high integrated index database and query data are provided to performs large amount of multi search calculation by respectively dividing the core of GPU(Graphic Processing Unit). CONSTITUTION: A database(300) divides the reference data of comparison and analysis in preset division unit. A query file generator(100) divides a confirmed target data of a user in dividing unit of the reference data. The query file generator generates query file. An operation unit(200) analyzes correspondence relation between the checked target data and reference data by comparison between the query file and the reference data.

Description

System and Method of parallel distributed processing of GPU by dividing dense indexed data-files into parts of search and computation in query and database system approximately}

본 발명은 데이터의 병렬분산처리 시스템 및 병렬분산처리 방법에 관한 것으로, 더욱 상세하게는 입력된 확인대상 데이터와, 데이터베이스에 저장된 기준 데이터를 비교분석하여, 확인대상 데이터와 기준 데이터의 대응 관계를 분석하는 시스템 및 이의 분석 방법에 관한 것이다.The present invention relates to a parallel distributed processing system and a parallel distributed processing method for data, and more particularly, to a system and method for parallel distributed processing of data by analyzing input data to be checked and reference data stored in a database, It relates to a system and an analysis method thereof.

본원 발명의 출원인은 특허등록 2009-0880531호에서, 대규모 데이터에 대하여 상기 데이터에 포함된 특정 검색 데이터로의 접근시간을 감소시키기 위하여, 데이터를 RVR 파일 및 RAT 파일 형태로 변환하여 저장하고, 저장된 데이터를 검색하는 방법에 대하여 특허등록을 받은 바 있다.Applicant of the present invention in Patent Registration No. 2009-0880531, in order to reduce the access time to the specific search data contained in the data for large-scale data, convert the data into RVR file and RAT file format, and store the stored data We have received a patent registration for how to search.

본 발명은 특허등록 2009-0880531호에서 개시한 RVR/RAT 파일로 저장된 기준 데이터를 이용하여 확인대상 데이터가 입력되는 경우, 이를 분할하여 병렬분산 계산에 의해, 상기 확인대상데이터와 상기 기준데이터의 대응관계를 분석하는 것에 대한 것이다.The present invention relates to a method and a system for dividing a data to be verified by using the reference data stored in an RVR / RAT file disclosed in Patent Registration No. 2009-0880531, It is about analyzing the relationship.

도 1은 종래의 단순 프로세싱에 의한 염기서열의 병렬분산처리 방법을 도시한 예시도이다.1 is an exemplary view showing a parallel dispersion processing method of the base sequence by the conventional simple processing.

이에 도시한 바와 같이, 종래 방법에 의하여 검색데이터와 기준데이터의 대응관계를 분석하는 방법을 설명하기로 한다.As shown in the drawing, a method of analyzing a correspondence between search data and reference data by a conventional method will be described.

먼저, 도시된 바와 같이, 기준데이터가 N개의 염기서열이고, 검색데이터가 M개의 염기서열인 경우를 예로 들어 설명한다.First, as shown, a case where the reference data is N base sequence, the search data is M base sequence will be described as an example.

이 경우, 기준데이터의 최초 시작점과 검색데이터의 최초 시작점을 동일위치에 놓고, 양 데이터가 일치하는지 검사한다.In this case, the first starting point of the reference data and the first starting point of the search data are placed at the same position, and the two data are checked for coincidence.

상기 검색데이터의 전 길이에 걸쳐 검사를 실시한 이후에, 상기 검색데이터의 최초 시작점을 상기 기준데이터의 두 번째 위치에 일치시키고, 양 데이터가 일치하는지 검사한다.After checking over the entire length of the search data, the first starting point of the search data is matched to the second position of the reference data, and both data are checked to match.

이와 같은 방법으로, 상기 검색데이터의 최초 시작점이 상기 기준데이터의 마지막 위치에 이를 때까지 순차적으로 반복한다. In this manner, the first starting point of the search data is sequentially repeated until the last position of the reference data is reached.

이와 같은 과정을 통해 상기 검색데이터가 상기 기준 데이터의 어느 위치에 해당하는지 파악될 수 있다. 따라서, 상기 검색데이터가 상기 기준데이터와 정확히 일치하는 위치를 하나 또는 그 이상 검색할 수 있다.Through this process, it is possible to grasp the position of the reference data corresponding to the search data. Accordingly, one or more locations where the search data exactly matches the reference data may be searched.

그러나, 상기 기준데이터가 인간의 유전체 염기서열이라고 하면, 상기 기준데이터의 개수는 약 32억 개에 해당하고, 검색데이터 역시 비교하려는 인간 유전체의 염기서열이라고 하면, 이 역시 32억 개에 달한다.However, if the reference data is a human genome sequence, the number of the reference data is about 3.2 billion, and if the search data is also the base sequence of the human genome to be compared, this amount is also 3.2 billion.

따라서, 양자의 비교를 위해서는 32억 개의 데이터 비교를 32억 번 실시하여야 한다. Therefore, in order to compare the two, three billion data comparisons must be performed three and a half times.

아울러, 인간의 유전체 염기서열은 정확히 일치하는 것이 아니고, 약 99% 정도의 일치성을 갖고 있으므로, 각 비교시 일치되는 염기의 개수를 이용해 확률을 산출하여, 일치 확률이 가장 높은 부분을 상기 검색데이터의 대응 위치로 산출하게 된다.In addition, since the genome sequences of humans are not exactly identical but have about 99% identity, the probability is calculated using the number of bases that are matched in each comparison, and the search data is identified as the portion having the highest probability. As shown in FIG.

그러므로, 상기한 바와 같은 종래기술에서는 다음과 같은 문제점이 있다.Therefore, the above-described conventional techniques have the following problems.

즉, 기준데이터 및 검색데이터가 유전자 염기서열과 같이 그 양이 방대할 경우, 단순한 비교 작업임에도 불구하고, 과도한 시간이 필요하고, 이러한 데이터를 처리하기 위하여 고가의 프로세싱 능력을 갖춘 시스템을 구축하여야 하는 문제점이 있었다.In other words, if the reference data and the search data are large in size, such as gene sequences, it is necessary to construct a system with expensive processing ability to process such data despite being a simple comparison work. There was a problem.

따라서, 이러한 문제점에 의해 휴먼 게놈 프로젝트의 실현이 더디게 진행되고 있으며, 다양한 휴먼 게놈 지도의 구축 및 이를 이용한 분석 연구가 어려워지는 문제점이 있었다.Therefore, due to these problems, the realization of the human genome project is progressing slowly, and various human genome maps are difficult to construct and analyze using them.

본 발명은 상기와 같은 종래의 문제점을 해결하기 위하여 안출된 것으로, RVR파일 및 병렬분산 프로세싱을 이용하여, 대규모 데이터 간의 비교분석 처리시간을 단축하고, 이를 위한 비교적 저렴한 시스템의 구축을 통해 비교분석 작업을 수행할 수 있는 고집적인덱스 DB 및 Query데이터의 검색과 연산기능 분할에 의한 GPU기반 병렬분산 처리 시스템 및 방법을 제공하는 것이다.The present invention has been made to solve the above-mentioned conventional problems, using RVR file and parallel distributed processing to reduce the comparative analysis processing time between large data, comparative analysis work through the construction of a relatively inexpensive system for this It is to provide a GPU-based parallel distributed processing system and method by performing a highly integrated index DB and Query data retrieval and partitioning of computational functions.

상기한 바와 같은 목적을 달성하기 위한 본 발명의 특징에 따르면, 본 발명은 비교분석의 기준이되는 기준데이터가 기 설정된 구분단위로 구분되어 저장되는 데이터베이스와; 사용자로부터 입력되는 확인대상데이터를 상기 기준데이터의 구분단위로 구분하여 Query 파일을 생성하는 Query파일생성부; 그리고 상기 Query 파일을 상기 기준데이터와 비교분석하여 상기 확인대상 데이터와 상기 기준데이터 간의 대응관계를 비교분석하는 연산부를 포함하여 구성되고; 상기 연산부는, 상기 Query 파일이 다중 Query 파일인 경우, 상기 기준데이터의 구분단위로 구분된 Query파일을 상기 구분단위보다 작은 분할단위로 분할하여 Query 임시검색파일을 생성하는 Query 임시검색파일 생성부와, GPU 프로세서를 포함하여 구성되어, 상기 GPU의 각 코어에서 상기 Query 임시검색파일의 각 Query를 상기 기준데이터와 비교하여, 상기 Query를 포함하는 상기 기준데이터의 구분단위 위치를 기록하여 검색결과파일을 생성하는 병렬연산부를 포함하여 구성되는 다중검색연산부와; CPU 프로세서를 포함하여 구성되어, 상기 Query 파일이 단일 Query 파일인 경우, 상기 Query 파일의 각 Query를 상기 기준데이터와 비교하여, 상기 Query를 포함하는 상기 기준데이터의 구분단위 위치를 기록하여 검색결과파일을 생성하는 단일검색연산부; 그리고 상기 Query 임시검색파일 생성부에서 생성된 각각의 Query 임시검색파일을 상기 병렬연산부의 GPU 코어들에 각각 할당하여 다중검색연산을 수행하도록 하거나, 상기 Query를 상기 단일검색연산부에 할당하여 단일검색연산을 수행하도록 하는 연산컨트롤러를 포함하여 구성된다.According to an aspect of the present invention, there is provided a data processing system including a database for storing reference data, which is a reference for comparison analysis, Query file generation unit for generating a query file by dividing the confirmation target data input from the user by the division unit of the reference data; And a calculation unit which compares and analyzes the Query file with the reference data to compare and analyze a correspondence relationship between the check target data and the reference data; The operation unit may include: a query temporary search file generation unit generating a query temporary search file by dividing a query file divided into division units of the reference data into division units smaller than the division unit when the query file is a multiple query file; And a GPU processor, and comparing each Query of the Query temporary search file with each of the reference data in each core of the GPU to record the location of the division unit of the reference data including the Query to generate a search result file. A multiple search operation unit configured to generate a parallel operation unit; When the query file is a single query file, the query file is configured to include a CPU processor. Each query of the query file is compared with the reference data to record the location of the division unit of the reference data including the query. A single search operation unit for generating a search result; In addition, each Query temporary search file generated by the Query temporary search file generation unit is allocated to the GPU cores of the parallel operation unit to perform multiple search operations, or the Query is assigned to the single search operation unit, It is configured to include an operation controller to perform the operation.

이때, 상기 데이터베이스는, 상기 기준데이터에 대한 고집적 인덱스 DB 및 기준 DB가 저장될 수도 있다.At this time, the database may store a highly integrated index DB and a reference DB for the reference data.

그리고 상기 연산컨트롤러는, 상기 검색결과파일을 저장 또는 출력하도록 하고; 상기 검색결과파일과 상기 Query 파일의 각 구분단위의 선후관계 및 상기 기준데이터의 선후관계를 이용하여, 상기 확인대상데이터와 상기 기준데이터의 대응관계에 대한 비교분석결과를 산출할 수도 있다.And the operation controller is configured to store or output the search result file; A comparative analysis result of the correspondence relationship between the verification target data and the reference data may be calculated by using the relationship between the search result file and each division unit of the query file and the relationship between the reference data.

또한, 상기 기준데이터 및 Query파일은 상기 구분단위에 대하여 대응식별자가 부여된 RVR(rack of virtual RAM)파일 및 상기 RVR 파일의 물리적 기록위치가 저장된 RAT(record allocation table)파일 형태로 저장될 수도 있고, 상기 Query 임시검색파일은 상기 분할단위에 대하여 대응식별자가 부여된 RVR(rack of virtual RAM)파일 및 상기 RVR 파일의 물리적 기록위치가 저장된 RAT(record allocation table)파일 형태로 저장될 수도 있다.In addition, the reference data and the query file may be stored in a RAT (record allocation table) file storing a rack of virtual RAM (RVR) file having a corresponding identifier for the division unit and a physical recording location of the RVR file , The Query temporary search file may be stored in a rack of virtual RAM (RVR) file having a corresponding identifier for the division unit and a RAT (record allocation table) file storing a physical recording location of the RVR file.

한편, 본 발명은 기 설정된 구분단위로 구분되어 저장된 기준데이터가 저장되는 데이터베이스와; 입력된 확인대상 데이터로부터 Query파일을 생성하는 Query 파일 생성부와; GPU 프로세서가 구비되어 다중검색연산을 수행하는 다중검색연산부와; CPU 프로세서가 구비되어 단일검색연산을 수행하는 단일검색연산부; 그리고 다중검색연산부와 단일검색연산부를 제어하는 연산컨트롤러를 포함하여 구성되는 GPU기반 병렬분산 처리 시스템을 이용하여 데이터를 병렬분산 처리하는 방법에 있어서, (A) 사용자로부터 확인대상 데이터를 입력받는 단계와; (B) 상기 확인대상 데이터를 상기 기준데이터의 구분단위로 구분하여 Query파일을 생성하는 단계와; (C) 상기 Query파일이 단일 Query 파일인 경우, 연산컨트롤러가 단일검색연산부를 이용하여 단일검색연산을 수행하여 검색결과파일을 생성하도록 하는 단계와; (D) 상기 Query파일이 다중 Query 파일인 경우, 상기 Query파일을 상기 구분단위보다 작은 분할단위로 분할하여 Query 임시검색파일을 생성하는 단계와; (E) 연산컨트롤러가 상기 Query 임시검색파일을 다중검색연산부의 상기 GPU의 각 코어에 할당하여 다중검색연산을 수행하여 검색결과파일을 생성하도록 하는 단계와; (F) 상기 검색결과파일생성을 이용하여 상기 Query 파일의 각 구분단위와 상기 기준데이터 간의 대응관계를 산출하는 단계를 포함하여 수행되는 고집적인덱스 DB 및 Query데이터의 검색과 연산기능 분할에 의한 GPU기반 병렬분산 처리 방법을 포함한다.On the other hand, the present invention is a database for storing the reference data divided and stored in a predetermined division unit; A query file generation unit for generating a query file from input verification target data; A multiple search operation unit provided with a GPU processor to perform multiple search operations; A single search operation unit having a CPU processor to perform a single search operation; A method for parallel distributed processing of data using a GPU-based parallel distributed processing system including a multiple search operation unit and an operation controller for controlling a single search operation unit, the method comprising the steps of: (A) ; (B) generating a query file by dividing the verification target data into division units of the reference data; (C) if the Query file is a single Query file, causing the operation controller to perform a single search operation using a single search operator to generate a search result file; (D) if the Query file is a multiple Query file, dividing the Query file into sub-units smaller than the division unit to generate a Query temporary search file; (E) an operation controller assigning the Query temporary search file to each core of the GPU of the multi-search calculation unit to perform a multi-search operation to generate a search result file; (F) GPU-based by searching and partitioning arithmetic function of highly integrated Dex DB and Query data, including calculating a correspondence between each division unit of the Query file and the reference data by using the search result file generation And a parallel distributed processing method.

이때, 상기 (E) 단계의 다중검색연산은, 상기 Query 임시검색파일의 각 Query를 상기 기준데이터와 비교하여, 상기 Query를 포함하는 상기 기준데이터의 구분단위 위치를 기록함에 의해 수행될 수도 있다.In this case, the multi-search operation of the step (E) may be performed by comparing each Query of the Query temporary search file with the reference data and recording the position of the division unit of the reference data including the Query.

그리고 상기 (F)단계는, 상기 검색결과로부터 Query 파일의 각 Query가 전부 또는 가장 많이 포함된 상기 기준데이터의 구분단위 위치를 산출하여, 상기 Query파일의 구분단위와 상기 기준데이터의 대응관계를 산출함에 의해 수행될 수도 있다.In the step (F), the location of the division unit of the reference data including all or the most queries in the query file is calculated from the search result, and the correspondence relationship between the division unit of the query file and the reference data is calculated. Or the like.

또한, 상기 (F) 단계는, 상기 Query파일의 각 구분단위에 대한 분석결과와 상기 Query파일의 각 구분단위의 물리적인 선후관계 및 상기 기준데이터의 구분단위의 물리적인 선후관계를 이용하여, 상기 확인대상데이터와 상기 기준데이터의 대응관계를 산출하는 연산을 포함하여 수행될 수도 있다.The step (F) may further include the steps of: using the physical result of the analysis of each division unit of the query file, the physical posterior relation of each division unit of the query file, and the physical posterior relation of the division unit of the reference data, And calculating an association between the verification target data and the reference data.

그리고 상기 기준데이터, Query파일 및 Query 임시검색파일은 기 설정된 구분단위로 구분되고, 각 구분단위에 대한 대응식별자가 부여되고, 상기 각 구분단위의 물리적 기록위치가 저장된 RAT/RVR 파일로 형성될 수도 있다.
The reference data, the query file, and the query temporary search file may be divided into predetermined division units, and corresponding identifiers may be given to each division unit, and may be formed as a RAT / RVR file in which physical recording positions of the division units are stored. have.

위에서 살핀 바와 같은 본 발명에 의한 고집적인덱스 DB 및 Query데이터의 검색과 연산기능 분할에 의한 GPU기반 병렬분산 처리 시스템 및 방법에서는 다음과 같은 효과를 기대할 수 있다.As described above, the GPU-based parallel distributed processing system and method by the highly integrated index DB and Query data retrieval and computation function division according to the present invention can be expected the following effects.

즉, 종래 CPU 만을 이용한 연산작업에 비하여, 대량의 다중검색연산을 GPU의 각 코어에 분할하여 수행하므로, 비교분석작업의 연산속도가 향상되는 장점이 있다.That is, compared with the conventional CPU-only operation, since a large number of multi-search operations are performed by dividing each core of the GPU, the operation speed of the comparative analysis operation is improved.

또한, 본 발명은 병렬분산을 수행하기 위해, 데이터를 다중 분할하게 되고, 이를 통해 통계기법을 통하여 대응관계를 산출하므로, 비교분석작업의 정확성이 향상되는 장점이 있다.In addition, the present invention is to multi-partition the data in order to perform parallel dispersion, thereby calculating the correspondence through the statistical technique, there is an advantage that the accuracy of the comparative analysis work is improved.

그리고 본 발명은 기준 데이터 및 다중 분할 데이터가 RVR 형태로 생성되어 저장되므로, 데이터에 대한 접근 속도가 향상되어, 전체 비교분석 작업의 연산속도가 더욱 향상되는 장점이 있다.In the present invention, since the reference data and the multi-partition data are generated and stored in the form of RVR, the access speed to the data is improved, and the operation speed of the entire comparative analysis work is further improved.

도 1은 종래의 단순 프로세싱에 의한 염기서열의 병렬분산 처리방법을 도시한 예시도.
도 2는 본 발명에 의한 데이터 병렬분산 처리시스템의 구체적인 실시예의 구성을 도시한 블럭도.
도 3은 본 발명에 의한 데이터 병렬분산 처리시스템을 구성하는 각 블럭의 수행작업을 형상화한 예시도.
도 4는 본 발명에 의한 데이터 병렬분산 처리방법의 구체적인 실시예를 도시한 흐름도.
도 5는 본 발명에 의한 문서 원문데이터의 일 예를 도시한 예시도.
도 6은 본 발명에 의한 문서 기준데이터의 일 예를 도시한 예시도.
도 7은 본 발명에 의한 확인대상데이터의 일 예를 도시한 예시도.
도 8은 본 발명에 의한 문서 Query 파일의 일 예를 도시한 예시도.
도 9는 본 발명에 의한 문서 Quey 임시검색파일의 일 예를 도시한 예시도.
도 10은 본 발명에 의한 문서 검색결과파일의 일 예를 도시한 예시도.
도 11은 본 발명에 의한 염기서열 원문데이터의 일 예를 도시한 예시도.
도 12는 본 발명에 의한 염기서열 기준데이터의 일 예를 도시한 예시도.
도 13은 본 발명에 의한 확인대상데이터의 일 예를 도시한 예시도.
도 14는 본 발명에 의한 염기서열 Query 파일의 일 예를 도시한 예시도.
도 15는 본 발명에 의한 염기서열 Query 임시검색파일의 일 예를 도시한 예시도.
도 16은 본 발명에 의한 염기서열 검색결과파일의 일 예를 도시한 예시도.1 is an exemplary view showing a parallel dispersion processing method of the base sequence by the conventional simple processing.
2 is a block diagram showing the configuration of a specific embodiment of a data parallel distributed processing system according to the present invention;
Figure 3 is an exemplary view of the execution of each block constituting the data parallel distributed processing system according to the present invention.
4 is a flowchart illustrating a specific embodiment of a data parallel processing method according to the present invention.
5 is an exemplary view showing an example of document text data according to the present invention;
6 is an exemplary view showing an example of document reference data according to the present invention;
7 is an exemplary view showing an example of the data to be checked according to the present invention.
8 is an exemplary view showing an example of a document Query file according to the present invention.
9 is an exemplary view showing an example of a document Quey temporary search file according to the present invention.
10 is an exemplary view showing an example of a document search result file according to the present invention.
Figure 11 is an exemplary view showing an example of the original sequence data according to the present invention.
12 is an exemplary view showing an example of the nucleotide sequence reference data according to the present invention.
13 is an exemplary view showing an example of the data to be checked according to the present invention.
14 is an exemplary view showing an example of a nucleotide sequence query file according to the present invention.
15 is an exemplary view showing an example of a base sequence Query temporary search file according to the present invention.
16 is an exemplary diagram showing an example of a nucleotide sequence search result file according to the present invention.

이하에서는 상기한 바와 같은 본 발명에 의한 고집적인덱스 DB 및 Query데이터의 검색과 연산기능 분할에 의한 GPU기반 병렬분산 처리 시스템 및 방법의 구체적인 실시예를 첨부된 도면을 참고하여 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings, a detailed embodiment of a GPU-based parallel distributed processing system and method by dividing the search and computation function of the highly integrated index DB and Query data as described above will be described in detail.

도 2는 본 발명에 의한 데이터 병렬분산 처리시스템의 구체적인 실시예의 구성을 도시한 블럭도이고, 도 3은 본 발명에 의한 데이터 병렬분산 처리시스템을 구성하는 각 블럭의 수행작업을 형상화한 예시도이다.2 is a block diagram showing the configuration of a specific embodiment of a data parallel distributed processing system according to the present invention, and FIG. 3 is an exemplary view illustrating the execution work of each block constituting the data parallel distributed processing system according to the present invention. .

이들 도면에 도시된 바와 같이, 본 발명에 의한 고집적인덱스 DB 및 Query데이터의 검색과 연산기능 분할에 의한 GPU기반 병렬분산 처리 시스템의 구체적인 실시예는, Query 파일 생서부(100), 연산부(200) 및 데이터베이스(300)를 포함하여 구성되고 또한, 데이터베이스(300)는 비교용 기준 DB(310)와 고집적 인덱스 DB(320)를 포함하여 구성되며, 상기 연산부(200)는 다중검색연산부(220)와 단일검색연산부(230) 그리고 연산컨트롤러(210)를 포함하여 구성된다.As shown in these drawings, a specific embodiment of the GPU-based parallel distributed processing system by retrieving the highly integrated index DB and Query data and partitioning the computation function according to the present invention, the query file generator 100, the calculation unit 200 And a database 300, the database 300 includes a comparison reference DB 310 and a highly integrated index DB 320, and the operation unit 200 includes a multi-search operation unit 220. The single search operation unit 230 and the operation controller 210 is configured to include.

그리고 상기 다중검색연산부(220)는 Query 임시검색파일 생성부(221)와 병렬연산부(222)를 포함하여 구성된다.The multi-search operator 220 includes a query temporary search file generator 221 and a parallel operator 222.

상기 Query파일생성부(100)는 입력된 확인대상데이터를 기 설정된 형태로 분할하여 RAT/RVR 파일 형태의 Query파일을 생성하는 부분이다. 여기서 RVR 파일이란 원본 데이터를 설정된 구분 단위로 구분하여 변환된 파일 형식을 말하고, RAT 파일이라 함은 상기 RVR 파일의 레코딩 단위의 위치 및 관계를 기록한 파일을 말한다.The query file generation unit 100 divides the input verification target data into a preset form to generate a query file in the form of a RAT / RVR file. Here, the RVR file refers to a file format in which original data is divided into set division units, and the RAT file refers to a file that records the position and relationship of the recording units of the RVR file.

상기 RVR 파일 및 RAT 파일의 구조 및 특성에 대하여는, 특허등록 2009-0880531호의 공개공보에 개시된 바 있으므로, 이에 대하여 본 명세서에서 상세히 설명하지는 않도록 한다.The structure and characteristics of the RVR file and the RAT file are disclosed in the patent publication No. 2009-0880531, and therefore, the description thereof will not be described in detail here.

한편, 상기 확인대상데이터가 후술할 기준데이터의 구분 단위보다 작은 단위로 입력되거나, 상기 기준데이터의 구분단위와 같은 단위로 구분되어 입력되는 경우, 상기 Query파일생성부(100)는 구비되지 않을 수도 있다.On the other hand, when the confirmation target data is input in a unit smaller than the division unit of the reference data to be described later, or divided into the same unit as the division unit of the reference data, the Query file generation unit 100 may not be provided. have.

또한, 도 3에 도시된 바와 같이, 상기 기준DB(310)는 기준데이터를 기 설정된 구분단위로 구분하여 식별인자를 부여한 DB이고, 상기 고집적 인덱스 DB(320)는 상기 기준 DB의 구분단위를 각 Query 단위에 대하여 인덱스화한 DB이다. 이후 상기 단일검색연산부(230) 또는 상기 다중검색연산부(220)의 연산과정에서 각 Query를 상기 고집적인덱스 DB와 비교하여 연산 수행 시간을 단축시킬 수 있다.As shown in FIG. 3, the reference DB 310 is a DB that divides reference data into predetermined division units and adds identification factors, and the highly integrated index DB 320 divides the division unit of the reference DB into DB indexed for query unit. Subsequently, during the operation of the single search operator 230 or the multiple search operator 220, each query may be compared with the highly integrated index DB to shorten the operation execution time.

이때, 상기 Query 파일 생성부는, 상기 Query파일에 따라 단일 Query 파일을 생성할 수도 있도, 고집적인덱스파일을 포함하는 다중 Query 파일을 생성할 수도 있다.In this case, the query file generation unit may generate a single query file according to the query file, or may generate a multiple query file including an integrated index file.

한편, 상기 연산부(200)는 상기 Query 파일생성부(100)에서 생성된 Query파일을 후술할 기준데이터와 비교분석하는 부분으로, 전술한 바와 같이, 다중검색연산부(220)와 단일검색연산부(230) 그리고 연산컨트롤러(210)를 포함하여 구성된다.The operation unit 200 is a unit for comparing and analyzing the query file generated by the query file generation unit 100 with reference data to be described later. As described above, the multiple search operation unit 220 and the single search operation unit 230 And an operation controller 210. [

상기 연산컨트롤러(210)는, 상기 Query 파일 생성부(100)에서 생성된 에서 생성된 Query 파일에 따라, 상기 Query 파일을 후술할 단일검색연산부(230) 또는 다중검색연산부(220)를 통해 비교분석 과정을 수행하는 부분이다.The operation controller 210 is compared and analyzed by the single search operator 230 or the multi-search operator 220 which will be described later, according to the query file generated by the query file generator 100. This is the part that performs the process.

이때, 상기 연산컨트롤러(210)는, 상기 Query 파일이 단일 파일인 경우, 상기 단일검색연산부(230)를 통해 비교분석을 수행하고, 상기 Query 파일이 다중 파일인 경우, 상기 다중검색연산부(220)를 통해 비교분석을 수행한다.In this case, when the Query file is a single file, the operation controller 210 performs a comparative analysis through the single search operator 230, and when the Query file is a multiple file, the multi-search operator 220. To conduct comparative analysis.

한편, 상기 다중검색연산부(220)는, Query 임시검색파일 생성부(221) 및 병렬연산부(222)를 포함하여 구성된다.On the other hand, the multi-search operation unit 220 is configured to include a query temporary search file generation unit 221 and a parallel operation unit 222.

그리고 상기 Query 임시검색파일 생성부(221)는 상기 Query파일생성부(100)에서 새성된 Query 파일에 대하여, 비교 검색을 실시할 Query 임시검색 파일을 생성하는 부분이다.The query temporary search file generator 221 is a part for generating a query temporary search file for comparison search on the query file created by the query file generator 100.

상기 Query 임시검색 파일이라 함은 상기 Query파일생성부(100)에서 형성된 상기 Query 파일을 상기 구분단위보다 작은 분할단위로 분할하여 RVR/RAT 파일 형태로 형성된 파일을 말한다.The query temporary search file refers to a file formed in the form of an RVR / RAT file by dividing the Query file formed by the Query file generation unit 100 into smaller units than the division unit.

따라서, 상기 연산컨트롤러(210)가 상기 다중검색연산부(220)를 통해 비교분석을 수행하는 경우, 상기 연산컨트롤러(210)는, 각각의 Query파일을 상기 다중검색연산부(220)의 Query 임시검색파일 생성부(221)를 통해 Query 임시검색파일을 생성하고, 상기 Query 임시검색 파일을 병렬연산부(222)의 GPU 코어에 각각 할당하여 각 코어가 병렬로 검색연산을 수행하도록 한다.Therefore, when the operation controller 210 performs a comparative analysis through the multi-search operation unit 220, the operation controller 210, each query file, the query temporary search file of the multi-search operation unit 220 The generator 221 generates a query temporary search file and allocates the query temporary search file to the GPU cores of the parallel operator 222 so that each core performs a search operation in parallel.

그리고, 상기 병렬연산부(222)는 각 코어에 할당된 상기 Query 임시검색파일을 후술할 데이터베이스(300)에 저장된 기준데이터와 비교하여, 검색결과파일을 생성하는 부분으로, 하나 이상의 GPU를 포함하여 구성된다.The parallel operation unit 222 compares the Query temporary search file allocated to each core with the reference data stored in the database 300 to be described later and generates a search result file. The parallel operation unit 222 includes one or more GPUs do.

이때, 상기 GPU를 이용한, CUDA 병렬분산 계산 방식이 이용되는데, 이는 CPU 코어 각각에 massive 한 데이터를 분할 할당하여 처리하는 것을 말한다.At this time, a CUDA parallel distributed calculation method using the GPU is used, which means that massive data is divided and allocated to each of the CPU cores.

최근 사용되는 GPU 하나는 240개의 코어를 포함하여 구성되고, 10배 정도의 메모리 인터페이스 속도를 나타낸다. 따라서, CPU 하나만으로 데이터를 처리하는 것에 비하여 GPU를 이용하여 병렬분산 계산 방식으로 데이터를 처리하는 경우, 최대 200배 이상의 처리 속도를 나타낼 수 있다.One recently used GPU is configured with 240 cores, representing a memory interface speed of about ten times. Therefore, when processing data in a parallel distributed computing method using a GPU, as compared to processing data with only a CPU, it can exhibit a processing speed of up to 200 times or more.

상기 병렬연산부(222)는 이와 같은 적절한 개수의 GPU를 포함하여 구성되고, 상기 연산컨트롤러(210)에 의해 할당된 각 GPU 코어를 통해 Query 임시검색파일을 상기 기준데이터와 비교하여 검색결과를 산출한다.The parallel operation unit 222 is configured to include such an appropriate number of GPUs, and calculates a search result by comparing the Query temporary search file with the reference data through each GPU core allocated by the operation controller 210. .

한편, 상기 단일검색연산부(230)는 단일 Query 파일을 상기 데이터베이스(300)에 저장된 기준데이터와 비교하여, 검색결과파일을 생성하는 부분으로, CPU(232)를 포함하여 구성된다.On the other hand, the single search operation unit 230 is a portion for generating a search result file by comparing a single query file with the reference data stored in the database 300, and comprises a CPU (232).

또한, 상기 데이터베이스(300)는 상기 다중검색연산부(220)와 단일검색연산부(230)의 연산 기준이 되는 기준데이터를 저장하는 부분으로, 상기 기준데이터 역시, 접근 속도를 향상시키고, 연산처리에 용이한 기준을 제공하기 위하여 RVR/RAT 파일 형태로 저장된다.The database 300 is a part for storing reference data which is a calculation reference of the multiple search operation unit 220 and the single search operation unit 230. The reference data also improves an access speed and is easy to process It is stored in the form of an RVR / RAT file to provide one criterion.

이상에는 본 발명에 의한 데이터 병렬분산처리 시스템을 구성하는 각 구성부의 기능을 설명하였다. In the above, the function of each structure part which comprises the data parallel distributed processing system by this invention was demonstrated.

이하에서는 도 3 및 도 4를 참조하여 본 발명에 의한 데이터 병렬분산처리방법을 처리 순서에 따라 살펴보고, 다음으로 도 5 내지 도 16을 참조하여, 본 발명에 의해 데이터 병렬분산처리가 수행되는 일 예를 살펴보기로 한다.Hereinafter, a data parallel distributed processing method according to the present invention will be described with reference to FIGs. 3 and 4, and then a data parallel distributed processing will be performed according to the present invention with reference to FIG. 5 to FIG. Let's look at an example.

도 4는 본 발명에 의한 데이터 병렬분산 처리방법의 구체적인 실시예를 도시한 흐름도이다.4 is a flowchart illustrating a specific embodiment of a data parallel distribution processing method according to the present invention.

이에 도시된 바와 같이, 본 발명에 의한 병렬분산계산을 이용한 데이터 병렬분산처리방법은 사용자에 의해 확인대상 데이터가 입력되는 것으로부터 시작된다(S110).As shown in the figure, the data parallel distributed processing method using the parallel distributed calculation according to the present invention starts from inputting confirmation target data by a user (S110).

상기 확인 대상 데이터가 입력되면, Query파일생성부(100)는 입력된 확인대상 데이터를 기 설정된 형태로 분할하여 RAT/RVR 파일 형태의 Query파일을 생성한다(S120).When the confirmation target data is input, the query file generation unit 100 divides the input confirmation target data into a preset form to generate a query file in the form of a RAT / RVR file (S120).

도 3에 도시된 바와 같이, 상기 Query파일생성부(100)에서 생성되는 Query파일은 확인대상 데이터를 다수 개로 구분하여, 대응 식별자(Offest List)(K1, K2.…) 및 기록위치를 포함하여 RAT/RVR 파일 형태로 형성된다.As shown in FIG. 3, the Query file generated by the Query file generator 100 includes a plurality of pieces of data to be checked, including a corresponding identifier (Offest List) (K1, K2. It is formed in the form of a RAT / RVR file.

이때, 상기 기록위치는 데이터 접근 속도를 위한 것으로 이하 도시하여 별도로 설명하지는 않도록 한다.In this case, the recording position is for the data access speed and will not be described separately below.

만일, 상기 Query 파일이 단일 Query 파일인 경우, 상기 연산컨트롤러(210)는 상기 단일검색연산부(230)의 CPU(232)를 이용하여, 상기 단일 Query파일의 각 Query를 데이터베이스(300)에 저장된 기준데이터와 비교하고, 상기 Query와 일치하는 상기 기준데이터의 대응식별자(ID1, ID2, …)를 기록하여 단일검색연산을 수행한다(S130). 이때, 상기 Query 파일은 단일 Query 파일로 형성될 수도 있고, 다중 Query 파일로 형성될 수도 있다.If the query file is a single query file, the operation controller 210 uses the CPU 232 of the single search operation unit 230 to store each query of the single query file in the database 300. (ID1, ID2,...) Corresponding to the query, and performs a single search operation (S130). In this case, the query file may be formed as a single query file, or may be formed as multiple query files.

한편, 상기 Query 파일이 다중 Query 파일인 경우, 상기 Query임시검색파일 생성부(221)가 상기 Query파일의 각 분할 단위에 대하여 각각 Query 임시검색파일을 생성한다(S140).Meanwhile, when the query file is a multiple query file, the query temporary search file generator 221 generates a temporary query search file for each split unit of the query file (S140).

여기서도 상기 Query 임시검색파일은, 상기 Query파일의 각 구분단위를 상기 분할단위로 분할하여 대응 식별자(Offest List)(Q1, Q2.…)를 포함하는 RAT/RVR 파일 형태로 형성된다.Here, the query temporary search file is formed in the form of a RAT / RVR file that divides each division unit of the query file into the division units and includes corresponding identifiers (Q, Q2, ...).

그리고 상기 연산컨트롤러(210)는 상기 병렬연산부(222)의 GPU(222)의 각 코어(Core)에 상기 Query 임시검색파일들을 각각 할당한다(S150).The operation controller 210 allocates the Query temporary search files to each core of the GPU 222 of the parallel operation unit 222 (S150).

이후에, 상기 Query 임시검색파일이 상기 GPU의 각 코어에 할당되면, 상기 병렬연산부(222)는 상기 Query 임시검색파일의 각 Query를 데이터베이스(300)에 저장된 기준데이터와 비교하고, 상기 Query와 일치하는 상기 기준데이터의 대응식별자(ID1, ID2, …)를 기록하여 다중검색연산을 수행한다(S160).Thereafter, when the Query temporary search file is assigned to each core of the GPU, the parallel operation unit 222 compares each Query of the Query temporary search file with the reference data stored in the database 300, A multi-search operation is performed by recording the corresponding identifiers (ID1, ID2, ...) of the reference data.

한편, 상기 단일검색연산부(230) 및 상기 다중검색연산부(220)에서 검색연산이 수행된 이후에, 상기 연산컨트롤러(210)는 수행된 검색결과를 저장한다(S170).On the other hand, after the search operation is performed in the single search operator 230 and the multiple search operator 220, the operation controller 210 stores the search results (S170).

이때, 상기 저장과정은 선택적으로 수행된다. In this case, the storing process is selectively performed.

여기서, 상기 검색결과파일은, 도 3에 도시된 바와 같이, 상기 Query와 일치하는 상기 기준데이터의 대응식별자(ID1, ID2, …)를 기록한 데이터로, 각 Query(Q1, Q2, Q3, …)에 대하여 각각 대응 식별자((ID1, ID2, …)가 포함되도록 구성된다.Here, as shown in FIG. 3, the search result file is data obtained by recording corresponding identifiers (ID1, ID2, ...) of the reference data matching the Query, and each Query (Q1, Q2, Q3, ...). (ID1, ID2, ...) are respectively included in the correspondence identifiers.

이후, 상기 연산컨트롤러(210)는 저장된 검색결과를 이용하여, 분석결과를 산출하여 제공한다. 이때, 상기 분석결과는, 상기 각 Query에 대응되는 상기 기준데이터의 대응식별자(ID1, ID2, …)들 중, 모든 Query에 포함된 대응 식별자 또는 가장 많은 Query에 포함된 대응식별자를 산출하는 과정을 말한다.Then, the calculation controller 210 calculates and provides analysis results using the stored search results. In this case, the analysis result may be a process of calculating a corresponding identifier included in all queries or a corresponding identifier included in most queries among the corresponding identifiers (ID1, ID2, ...) of the reference data corresponding to the respective queries. Say.

전술한 바와 같이, 확인대상데이터 및 기준데이터가 유전자 염기서열인 경우, 각 개인의 모든 유전자 염기서열이 일치하는 것이 아니므로, 확률에 의한 결과 산출이 필요하다.As described above, when the data to be verified and the reference data are gene base sequences, all the gene base sequences of the individual individuals do not coincide with each other, and therefore, calculation of the result by probability is required.

이때, 상기 분석결과는 상기 Query파일의 각 구분단위(K1, K2,…)에 대한 상기 기준데이터와의 대응관계를 나타내는 비교분석결과일 수도 있고, 상기 Query파일의 각 구분단위(K1, K2,…)에 대한 상기 기준데이터와의 대응관계들을 이용하여, 상기 확인대상 데이터 전체에 대한 상기 기준데이터와의 대응관계를 나타내는 비교분석결과일 수도 있다.The analysis result may be a comparison analysis result indicating a correspondence relationship between the reference data and each of the division units K1, K2, ... of the query file, It may be a result of comparison analysis indicating a correspondence relationship with the reference data for the entire data to be checked using the correspondence relationship with the reference data for ...).

즉, 상기 Query파일은 각 구분단위(K1, K2,…)를 통해 물리적인 선후관계를 파악할 수 있고, 상기 기준데이터의 대응 식별자(ID1, ID2, …)를 통해서도 역시 선후관계를 파악할 수 있다. 따라서, 상기 Query파일 및 상기 기준데이터의 선후 관계를 통해, 확률상 가장 타당한 상기 확인대상 데이터와 상기 기준데이터의 대응관계를 산출할 수 있다.In other words, the query file can identify the physical relationship between each other through each division unit (K1, K2, ...), and can also grasp the relationship between each other through the corresponding identifier (ID1, ID2, ...) of the reference data. Therefore, the correspondence relation between the verification data most probable on the probability and the reference data can be calculated through the posterior relationship between the query file and the reference data.

이하에서는, 실제 데이터 샘플을 통해 본 발명에 의한 병렬분산계산을 이용한 데이터 비교분석이 수행되는 예를 설명하기로 한다.Hereinafter, an example in which data comparison and analysis using parallel dispersion calculation according to the present invention is performed through actual data samples will be described.

먼저, 도 5 내지 도 10을 통해, 설명이 용이한 확인대상 데이터 및 기준데이터가 문서데이터인 경우를 설명하기로 한다.First, a case in which the easy-to-determine confirmation target data and reference data are document data will be described with reference to FIGS. 5 to 10.

도 5는 본 발명에 의한 문서 원문데이터의 일 예를 도시한 예시도이고, 도 6은 본 발명에 의한 문서 기준데이터의 일 예를 도시한 예시도이며, 도 7은 본 발명에 의한 확인대상데이터의 일 예를 도시한 예시도이고, 도 8은 본 발명에 의한 Query파일의 일 예를 도시한 예시도이며, 도 9는 본 발명에 의한 문서 Quey 임시검색파일의 일 예를 도시한 예시도이고, 도 10은 본 발명에 의한 문서 검색결과파일의 일 예를 도시한 예시도이다.5 is an exemplary view showing an example of document text data according to the present invention, Figure 6 is an exemplary view showing an example of document reference data according to the present invention, Figure 7 is a confirmation target data according to the present invention FIG. 8 is an exemplary view showing an example of a query file according to the present invention, FIG. 9 is an exemplary view showing an example of a document Quey temporary search file according to the present invention, and FIG. And FIG. 10 is an exemplary diagram showing an example of a document search result file according to the present invention.

원문 데이터가 도 5에 도시된 바와 같은 내용의 문서 데이터라고 가정하면, 상기 데이터베이스(300)에 저장되는 기준데이터는 도 6에 도시된 바와 같은 RAT/RVR 파일이 된다.Assuming original text data is document data having contents as shown in FIG. 5, reference data stored in the database 300 are RAT / RVR files as shown in FIG. 6.

이때, 상기 기준데이터의 구분단위는 문장단위가 된다. 따라서, 전체 원문데이터를 문장단위로 구분하고 각 문장에 대응 식별자(ID1, ID2, …)가 부여된다.At this time, the division unit of the reference data is a sentence unit. Therefore, all original text data is divided into sentence units, and corresponding identifiers ID1, ID2, ... are assigned to each sentence.

그리고, 사용자가 비교분석을 원하는 확인대상 데이터가, 도 7에 도시된 바와 같이, '이러한 노력을 인정받아 2006년에는 WIPO로부터 공식 국제지식재산교육기관으로 인증을 받았다. 2009년 한해 동안에도 10여회에 거쳐 아세안특허심사관 교육 등 외국인을 대상으로 한 교육과정과 각종 국제세미나를 성공적으로 개최하였다.' 인 경우, 생성되는 Query파일은 도 8과 같이, 상기 확인대상 데이터를 문장단위로 구분하여 대응식별자(K1, K2)를 부여한 RVR 파일이 된다.As shown in FIG. 7, the data to be confirmed by the user for comparative analysis, as shown in FIG. 7, was recognized by the WIPO as an official international intellectual property education institution in 2006 in recognition of such efforts. In 2009, we held a series of international seminars and training courses for foreigners such as ASEAN Patent Examiner training. In this case, as shown in FIG. 8, the generated Query file is an RVR file in which the identification target data is divided into sentence units and corresponding identifiers K1 and K2 are assigned.

또한, 상기 Query파일로부터 생성되는 Query임시검색파일은, 상기 문장단위로 구분된 Query파일을 단어 단위로 구분한 RVR 파일로, Query파일 K1에 대한 Query임시검색파일은 도 9에 도시된 바와 같다. 따라서, 생성되는 Query 임시검색파일은 K1에 대한 Query 임시검색파일과 K2에 대한 Query 임시검색파일이므로, 다중 Query 임시검색파일이 된다. In addition, the Query temporary search file generated from the Query file is an RVR file in which the Query files classified by the sentence are divided into words, and the Query temporary search file for the Query file K1 is as shown in FIG. Therefore, the generated Query temporary search file is a Query temporary search file for K1 and a Query temporary search file for K2, and thus becomes a multiple Query temporary search file.

이후, 상기 Query 임시검색파일의 각 Query를 상기 기준데이터와 비교하여 상기 Query가 포함된 상기 기준파일의 대응식별자를 저장한 검색 결과파일은 도 10에 도시된 바와 같다.Thereafter, a search result file storing corresponding identifiers of the reference file including the query by comparing each Query of the Query temporary search file with the reference data is illustrated in FIG. 10.

그리고, 도 10에 도시된 바와 같은, 검색결과에서 상기 각 Query에 공통되어 포함된 대응식별자는 ID5 이고, 따라서, 단일검색연산결과 상기 Query파일 K1은 기준데이터의 ID5에 해당함을 알 수 있다.As shown in FIG. 10, the corresponding identifier commonly included in each query in the search result is ID5. Accordingly, it can be seen that the query file K1 corresponds to ID5 of the reference data.

다음으로, 도 11 내지 도 16을 통해, 확인대상 데이터 및 기준데이터가 염기서열데이터인 경우를 설명하기로 한다.Next, the case where the confirmation target data and the reference data are the base sequence data will be described with reference to FIGS. 11 to 16.

본원 발명은 기준데이터 및 확인대상데이터의 크기가 클수록 뛰어난 효과를 나타낸다. 또한, 상기 기준데이터 및 확인대상데이터의 구성이 단순하여 동일한 반복구조가 많을수록 뛰어난 효과를 나타낸다.The larger the size of the reference data and the data to be verified, the greater the effect of the present invention is. In addition, since the configuration of the reference data and the confirmation target data is simple, the more the same repeating structure is, the better the effect is.

이러한 의미에서, 본 원 발명은 상기 기준데이터 및 확인대상데이터가 염기서열인 경우에 뛰어난 효과가 나타난다.In this sense, the present invention exhibits excellent effects when the reference data and the data to be confirmed are nucleotide sequences.

도 11은 본 발명에 의한 염기서열 원문데이터의 일 예를 도시한 예시도이고, 도 12는 본 발명에 의한 염기서열 기준데이터의 일 예를 도시한 예시도이며, 도 13은 본 발명에 의한 확인대상데이터의 일 예를 도시한 예시도이고, 도 14는 본 발명에 의한 염기서열 Query파일의 일 예를 도시한 예시도이며, 도 15는 본 발명에 의한 염기서열 Quey임시검색 파일의 일 예를 도시한 예시도이고, 도 16은 본 발명에 의한 염기서열 검색결과파일의 일 예를 도시한 예시도이다.11 is an exemplary view showing an example of the base sequence original data according to the present invention, Figure 12 is an illustration showing an example of the base sequence reference data according to the present invention, Figure 13 is confirmed by the present invention FIG. 14 is an exemplary view showing an example of a base sequence query file according to the present invention, and FIG. 15 is an example of a base sequence Quey temporary search file according to the present invention. 16 is an exemplary diagram showing, and FIG. 16 is an exemplary diagram showing an example of a nucleotide sequence search result file according to the present invention.

원문 데이터가 도 11에 도시된 바와 같은 인간의 염기서열이라고 가정하면, 전체 유전자 염기서열의 개수는 약 32억 개가 된다. Assuming that the original data is a human nucleotide sequence as shown in FIG. 11, the total number of gene sequences is about 3.2 billion.

이때, 상기 데이터베이스(300)에 저장되는 기준데이터의 구분단위를 염기서열 12개라고 가정하면, 상기 기준데이터는 도 12에 도시된 바와 같이 약 2억 7000개(ID1~ID270,000,000)로 구분된다.In this case, assuming that the division unit of the reference data stored in the database 300 is 12 base sequences, the reference data is divided into about 200 million (ID1 to ID270,000,000) as shown in FIG. 12. .

그리고, 입력되는 확인대상 데이터가 도 13에 도시된 바와 같은 24개의 염기서열이라고 가정한다. 물론, 실제로는 상기 확인 대상 데이터도 인간의 전체 유전자 염기서열일 수 있으나, 본 명세서에서는 설명의 편의상 일부 염기서열을 가정하여 설명한다.It is assumed that the input confirmation data is 24 base sequences as shown in Fig. Of course, in reality, the data to be identified may be a whole human gene sequence, but in the present specification, some base sequences are assumed for convenience of explanation.

이때, 상기 Query파일은 상기 기준데이터의 구분단위와 동일하게 12개의 염기서열로 구분되고, 도 14에 도시한 바와 같이 다중 Query 파일이 된다.At this time, the query file is divided into 12 base sequences in the same way as the division unit of the reference data, and as shown in FIG.

그리고 상기 Query파일로부터 생성되는 Query임시검색파일 중, Query파일 K1에 대한 Query임시검색파일은, 도15 에 도시된 바와 같이, 상기 K1 Query파일을 기 설정된 길이(4개로 가정)로 구분하되, 제1번째 염기서열로부터 제4번째 염기서열, 제2번째 염기서열로부터 제5번째 염기서열, 제3번째 염기서열로부터 제7번째 염기서열, … , 제9번째 염기서열로부터 제12번째 염기서열로 각각 시작점을 이동하면서 중첩되게 구분되어 형성된다.Among the temporary query search files generated from the query file, the query temporary search file for the query file K1 is divided into predetermined lengths (assumed to be four) of the K1 query files as shown in FIG. A fourth nucleotide sequence from the first nucleotide sequence, a fifth nucleotide sequence from the second nucleotide sequence, a seventh nucleotide sequence from the third nucleotide sequence, In this case, the starting point is shifted from the ninth nucleotide sequence to the twelfth nucleotide sequence, respectively.

이후, 상기 Query임시검색파일의 각 Query를 상기 기준데이터와 비교하여 상기 Query가 포함된 상기 기준파일의 대응 식별자를 저장한 검색 결과파일은 도 16에 도시된 바와 같다.Thereafter, the search result file which stores the corresponding identifier of the reference file including the query by comparing each Query of the Query temporary search file with the reference data is shown in FIG. 16.

그리고, 도 16에 도시된 바와 같은, 검색결과에서 상기 각 Query에 공통되어 포함된 대응식별자는 ID5 이고, 따라서, 상기 Query파일 K1은 기준데이터의 ID5에 해당함을 알 수 있다.As shown in FIG. 16, in the search result, the corresponding identifier included in common to the respective queries is ID5, so that the query file K1 corresponds to ID5 of the reference data.

이때, 상기 검색결과는 가정한 데이터 내에서 실제 비교분석 결과를 나타낸 것으로, 기준데이터가 132개의 염기서열이고, 확인대상데이터(24개의 염기서열) 중 절반인 12개의 염기서열에 대한 비교 결과임에도 불구하고, 공통적으로 포함된 식별자(ID1, ID3, ID5, ID6, ID8)가 비교적 많은 것을 알 수 있다.In this case, the search result shows the actual comparative analysis result in the assumed data, although the reference data is 132 nucleotide sequences and the comparison result for 12 nucleotide sequences, which is half of the data to be identified (24 nucleotide sequences). In addition, it can be seen that there are relatively many identifiers (ID1, ID3, ID5, ID6, and ID8) included in common.

따라서, 실제 인간 유전자의 염기서열 전체에 대하여 비교분석 작업을 하는 경우, 본원 발명의 효과는 두드러지게 나타난다.Therefore, the effect of the present invention is remarkable when the comparative analysis of the entire nucleotide sequence of the actual human gene is performed.

또한, 나아가 상기 비교분석결과는 상기 확인대상 데이터와 기준데이터 간의 대응관계일 수 있다.In addition, the comparative analysis result may be a correspondence relationship between the data to be checked and the reference data.

즉, 위 예에서 상기 Query파일 K1은 기준데이터 ID5에 해당함을 알 수 있다. 그리고 상기 Query파일 K2의 검색결과 상기 기준데이터의 ID6 및 ID9에 해당한다고 가정하면, 상기 K1 및 K2의 선후관계 및 ID5, ID6 및 ID9의 선후관계를 기준으로 판단하면 Query파일 K2는 상기 기준데이터 ID6에 대응하는 것으로 판단된다.That is, in the above example, the query file K1 corresponds to the reference data ID5. If it is assumed that the search results of the query file K2 correspond to ID6 and ID9 of the reference data, the query file K2 determines the reference data ID6 based on the prognostic relationship of the K1 and K2 and the prognostic relationship between the ID5, ID6, and ID9. As shown in FIG.

따라서, 최종 비교분석결과는 확인대상 데이터는 상기 기준데이터의 ID5 내지 ID6에 대응하는 염기서열임을 알 수 있다.Therefore, the final comparative analysis result can be seen that the data to be confirmed is the base sequence corresponding to ID5 to ID6 of the reference data.

본 발명의 권리는 위에서 설명된 실시예에 한정되지 않고 청구범위에 기재된 바에 의해 정의되며, 본 발명의 분야에서 통상의 지식을 가진 자가 청구범위에 기재된 권리범위 내에서 다양한 변형과 개작을 할 수 있다는 것은 자명하다.The rights of the present invention are not limited to the embodiments described above, but are defined by the claims, and those skilled in the art can make various modifications and adaptations within the scope of the claims. It is self-evident.

본 발명은 입력된 확인대상 데이터와, 데이터 베이스에 저장된 기준 데이터를 병렬분산처리하여, 확인대상 데이터와 기준 데이터의 대응 관계를 분석하는 시스템에 관한 것으로, 이와 같은 본 발명에 의하면, 종래 CPU를 이용한 비교분석작업에 비하여, 대량의 다중검색연산을 GPU의 각 코어에 분할하여 수행하므로, 비교분석작업의 연산속도가 향상되는 장점이 있다.The present invention relates to a system for parallelly distributing input verification data and reference data stored in a database and analyzing correspondence between verification data and reference data. According to the present invention, Compared to the comparative analysis work, since a large number of multi-search operations are performed by dividing each core of the GPU, the operation speed of the comparative analysis work is improved.

100 : Query파일생성부 200 : 연산부
210 : 연산컨트롤러 220 : 다중검색연산부
221 : Query 임시검색파일 생성부 222 : 병렬연산부
230 ; 단일검색연산부 232 : CPU
300 : 데이터베이스 310 : 기준데이터베이스
320 : 고집적인덱스 데이터베이스100: query file generator 200: calculator
210: operation controller 220: multi-search operation unit
221: Query temporary search file generation unit 222: Parallel operation unit
230; Single search operation unit 232: CPU
300: database 310: reference database
320: high-density index database

Claims

A database in which reference data, which is a reference for comparison analysis, are divided and stored in predetermined division units;
Query file generation unit for generating a query file by dividing the confirmation target data input from the user by the division unit of the reference data; And
And a calculation unit for comparing and analyzing the Query file with the reference data to compare and analyze the correspondence between the check target data and the reference data;
The calculation unit,
A Query temporary search file generating unit for generating a Query temporary search file by dividing the Query file divided by the division unit of the reference data into smaller units smaller than the division unit when the Query file is a multiple Query file, Wherein each of the cores of the GPU compares each Query of the Query temporary search file with the reference data to record a division unit position of the reference data including the Query to generate a search result file, A multiple search operation unit configured to include;
And a CPU processor. When the query file is a single query file, each query of the query file is compared with the reference data to record the division unit position of the reference data including the query, Single search operation unit for generating a; And
Each query temporary search file generated by the query temporary search file generation unit is allocated to GPU cores of the parallel operation unit to perform multi-search operation, or the query is assigned to the single search operation unit to perform a single search operation. Graphic processor-based parallel distributed processing system, which is composed of a high-density index database and query data retrieval and partitioning of computational functions.

The method of claim 1,
The database,
And a high-density index DB and a reference DB for the reference data are stored, and a graphics processor-based parallel distributed processing system by retrieving and partitioning arithmetic function of the query data.

The method of claim 2,
The operation controller,
Save or output the search result file;
Highly integrated, characterized in that the comparative analysis results for the corresponding relationship between the confirmation target data and the reference data is calculated using the relationship between the search result file and each of the division unit of the query file and the relationship between the reference data Graphic processor-based parallel distributed processing system by retrieval of index database and query data and partitioning of computational functions.

The method according to any one of claims 1 to 3,
Wherein the reference data and the query file are stored in a rack of virtual RAM (RVR) file having a corresponding identifier for the division unit and a record allocation table (RAT) file storing a physical recording location of the RVR file;
The query temporary search file is stored in the form of a high density index, which is stored in the form of a rack of virtual RAM (RVR) file to which a corresponding identifier is assigned to the partition unit, and a record allocation table (RAT) file in which the physical recording location of the RVR file is stored. Graphic processor-based parallel distributed processing system by retrieving database and query data and dividing arithmetic functions.

A database for storing reference data divided into predetermined division units; A Query file generation unit generating a Query file from the input confirmation target data; A multiple search operation unit provided with a GPU processor to perform multiple search operations; A single search operation unit provided with a CPU processor to perform a single search operation; In the method for parallel data processing using a GPU-based parallel distributed processing system comprising a multi-search operation unit and an operation controller for controlling a single search operation unit,
(A) receiving confirmation data from the user;
(B) generating a Query file by dividing the identification target data into division units of the reference data by the Query file generation unit;
(C) if the Query file is a single Query file, causing the operation controller to perform a single search operation using a single search operator to generate a search result file;
(D) if the Query file is a multiple Query file, dividing the Query file into sub-units smaller than the division unit to generate a Query temporary search file;
(E) an operation controller assigning the Query temporary search file to each core of the GPU of the multi-search calculation unit to perform a multi-search operation to generate a search result file;
(F) calculating the correspondence between each division unit of the query file and the reference data using the generation of the search result file. Graphic processor based parallel distributed processing method.

The method of claim 5, wherein
The multiple search operation of the step (E),
Comparing each query of the Query temporary search file with the reference data, and recording the location of the division unit of the reference data including the query, the search and calculation function of the highly integrated index database and query data Graphic processor based parallel distributed processing method.

The method according to claim 6,
Step (F) is,
Highly integrated, characterized in that by calculating the position of the division unit of the reference data that contains all or most of the Query in the query file from the search results, calculating the correspondence relationship between the division unit of the Query file and the reference data Graphic processor-based parallel distributed processing by retrieving index database and query data and partitioning calculation functions.

The method of claim 7, wherein
Step (F) is
A correspondence between the verification object data and the reference data using the analysis result of each division unit of the Query file, the physical posterior relation of each division unit of the Query file, and the physical posterior relation of the division unit of the reference data, A graphics processor-based parallel distributed processing method by retrieving a highly integrated index database and query data and partitioning arithmetic functions, further comprising an operation for calculating a relationship.

The method according to any one of claims 5 to 8,
Wherein the reference data and the query file are stored in a rack of virtual RAM (RVR) file having a corresponding identifier for the division unit and a record allocation table (RAT) file storing a physical recording location of the RVR file;
The query temporary search file is stored in the form of a high density index, which is stored in the form of a rack of virtual RAM (RVR) file to which a corresponding identifier is assigned to the partition unit, and a record allocation table (RAT) file in which the physical recording location of the RVR file is stored. Graphic processor-based parallel distributed processing by retrieving database and query data and dividing arithmetic functions.