KR20150025326A

KR20150025326A - Index-based data process system

Info

Publication number: KR20150025326A
Application number: KR20130102831A
Authority: KR
Inventors: 민덕기; 예쉬화
Original assignee: 건국대학교 산학협력단
Priority date: 2013-08-29
Filing date: 2013-08-29
Publication date: 2015-03-10
Also published as: KR101567861B1

Abstract

Provided is an index-based data processing system capable of increasing data processing speed by generating indexes for data blocks and filtering a block desired to be processed. According to an embodiment of the present invention, an index-based data processing system includes: a database unit configured to divide a received file into a plurality of blocks and store the divided blocks; an index generating unit included in the database unit to generate a plurality of indexes for the blocks; and a processing unit configured to process the blocks according to queries, wherein the processing unit extracts specified data from a specified block based on the indexes and processes the extracted specified data.

Description

[0001] INDEX-BASED DATA PROCESS SYSTEM [0002]

본 발명은 인덱스 기반 데이터 처리 시스템에 관한 것으로, 보다 자세하게는 특정 블록을 필터링하여 데이터 처리속도를 향상시킨 데이터 처리 시스템에 관한 것이다.The present invention relates to an index-based data processing system, and more particularly, to a data processing system that improves data processing speed by filtering a specific block.

웹 2.0의 등장으로 인터넷 서비스가 공급자 중심에서 사용자 중심으로 패러다임이 이동함에 따라 UCC, 개인화 서비스와 같은 인터넷 서비스 시장이 급속도로 증가하고 있다. 이러한 패러다임의 변화로 사용자에 의해서 생성되고 인터넷 서비스를 위해 수집, 처리, 그리고 관리해야 하는 데이터의 양이 빠르게 증가하고 있다. 이와 같은 대용량 데이터의 수집, 처리 및 관리를 위하여, 현재 많은 인터넷 포탈에서 저비용으로 대규모 클러스터를 구축하여 대용량 데이터 분산 관리 및 작업 분산 병렬 처리하는 기술에 대하여 많은 연구를 하고있으며, 작업 분산 병렬 처리 기술 중에서 미국 구글 사의 맵리듀스(MapReduce) 모델이 대표적인 작업 분산 병렬 처리방법 중에 하나로 주목을 받고 있다.With the advent of Web 2.0, as the paradigm shifts from provider-centric to user-centric, Internet service market such as UCC and personalization service is rapidly increasing. With this paradigm shift, the amount of data that is generated by users and collected, processed, and managed for Internet services is increasing rapidly. In order to collect, process, and manage large amounts of data, many studies have been conducted on large-scale data distribution management and job distribution parallel processing by building large-scale clusters at low cost in many Internet portals. The MapReduce model from Google Inc. is one of the most popular distributed parallel processing methods.

맵리듀스 모델은 Google 사에서 저비용 대규모 노드로 구성된 클러스터 상에 저장된 대용량 데이터에 대한 분산 병렬 연산을 지원하기 위하여 제안한 분산 병렬 처리 프로그래밍 모델이다. 맵리듀스 모델 기반의 분산 병렬 처리 시스템으로는, 구글의 맵리듀스 시스템, Apache Software Foundation의 하둡(Hadoop) 맵리듀스 시스템과 같은 분산 병렬 처리 시스템이 있다. 하둡은 맵리듀스 패러다임을 지원하는 최근 가장 유명한 맵리듀스 프레임워크이다. 또한 하둡 프레임워크는 확장성 있고 안정적인 분산 컴퓨팅 환경을 제공하기 때문에 많은 개발자들은 하둡을 통해 성공적인 오픈 소스 프로젝트를 진행하고 있다.The MapReduce model is a distributed parallel processing programming model proposed by Google to support distributed parallel operations on large amounts of data stored in a cluster of low-cost large-scale nodes. Distributed parallel processing systems based on the MapReduce model include distributed parallel processing systems such as Google's MapReduce system and Apache Software Foundation's Hadoop MapReduce system. Hadoop is the latest and most popular MapReduce framework to support the MapReduce paradigm. And because the Hadoop framework provides a scalable and reliable distributed computing environment, many developers are working on successful open source projects with Hadoop.

인터넷 서비스를 제공하는 인터넷 포탈 입장에서는 엄청난 속도로 수집되는 방대한 양의 스트림 데이터로부터 가능한 빨리 의미 있는 정보를 추출하여 사용자에게 서비스하는 능력이 기업의 경쟁력이 된다. 그러나 기존의 맵리듀스 시스템과 같은 빅 데이터 처리 시스템은 입력받은 모든 데이터를 여과없이 처리하기 때문에 불필요한 연산과정이 많아 데이터 처리의 효율성에 의문이 제기되었다.From the viewpoint of the Internet portal that provides the Internet service, the competitiveness of the enterprise is the ability to extract meaningful information from the vast amount of stream data collected at a great speed and serve it to the users as soon as possible. However, since the big data processing system such as the existing MapReduce system processes all input data without filtering, the efficiency of data processing is questioned because there are many unnecessary operation processes.

위와 같은 문제점으로부터 안출된 본 발명이 해결하고자 하는 기술적 과제는, 데이터 블록에 인덱스를 생성하여 처리하고자 하는 블록을 필터링하여 데이터 처리 속도를 향상시킨 데이터 처리 시스템을 제공하고자 하는 것이다.SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a data processing system in which an index is generated in a data block and a block to be processed is filtered to improve data processing speed.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical objects of the present invention are not limited to the above-mentioned technical problems, and other technical subjects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 언급된 기술적 과제들을 해결하기 위한, 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템은, 수신된 파일을 복수의 블록으로 분할하여 저장하는 데이터베이스부, 상기 데이터베이스부에 포함되어 상기 복수의 블록에 복수의 인덱스를 생성하는 인덱스 생성부 및 쿼리에 따라 상기 복수의 블록을 처리하는 처리부를 포함하되, 상기 처리부는 상기 복수의 인덱스를 기초로 상기 특정블록에서 특정 데이터를 추출하여 처리한다.According to an aspect of the present invention, there is provided an index-based data processing system including a database unit that divides a received file into a plurality of blocks and stores the divided files, And a processing unit for processing the plurality of blocks according to a query, wherein the processing unit extracts and processes specific data in the specific block based on the plurality of indexes.

상기 복수의 인덱스는 제1 인덱스와 제2 인덱스를 포함할 수 있다.The plurality of indices may include a first index and a second index.

상기 제1 인덱스는 블록의 정보를 제공하고, 상기 제2 인덱스는 상기 블록의 데이터 정보를 제공할 수 있다.The first index may provide information of a block, and the second index may provide data information of the block.

상기 복수의 블록은 정형 또는 비정형 구조로 구성될 수 있다.The plurality of blocks may be configured as a regular or irregular structure.

상기 인덱스 생성부는 상기 복수의 블록에서 제2 인덱스를 수집하고, 상기 제2 인덱스에서 후보군을 설정하여 상기 후보군에서 상기 제1 인덱스를 생성할 수 있다.The index generator may collect a second index in the plurality of blocks, and may set a candidate group in the second index to generate the first index in the candidate group.

상기 제1 인덱스는 특정 입력 대기시간이 지나거나 상기 제2 인덱스가 변경되면 업데이트될 수 있다.The first index may be updated if a specific input waiting time has passed or the second index has changed.

상기 데이터베이스부는 분산파일시스템(DFS)으로 구성될 수 있다.The database unit may be configured as a distributed file system (DFS).

상기 데이터베이스부는 블록이 저장되는 경우 상기 블록을 복제하여 저장하고, 상기 블록과 복제된 블록의 위치정보를 보관할 수 있다.The database unit may replicate and store the block when the block is stored, and may store the location information of the block and the replicated block.

상기 처리부는 제1 처리부와 제2 처리부를 포함할 수 있다.The processing unit may include a first processing unit and a second processing unit.

상기 제1 처리부는 상기 특정 데이터를 분산하여 병렬처리할 수 있다.The first processing unit may distribute and parallel-process the specific data.

상기 제2 처리부는 처리된 데이터를 다시 합산하여 결과를 산출할 수 있다.And the second processing unit may re-sum the processed data to calculate a result.

상기 처리부는 상기 제1 인덱스를 기초로 상기 제1 처리부의 병렬 분산 작업량을 결정할 수 있다.The processing unit may determine an amount of parallel distributed work of the first processing unit based on the first index.

상기 인덱스 생성부에서 상기 제1 인덱스를 생성하지 못한 경우, 상기 처리부는 상기 제2 인덱스를 기초로 작업량을 결정할 수 있다.If the index generating unit fails to generate the first index, the processing unit may determine an amount of work based on the second index.

수신된 파일을 복수의 블록으로 분할하는 단계, 상기 복수의 블록에 제1 인덱스와 제2 인덱스를 생성하여 저장하는 단계, 클라이언트의 쿼리를 수신하는 단계, 상기 쿼리에 따라 상기 제1 인덱스를 포함하는 특정블록을 선택하는 단계 및 상기 특정블록을 처리하는 단계를 포함하되, 상기 제2 인덱스를 기초로 상기 특정블록에서 특정 데이터를 추출하여 처리하는 단계를 포함한다.Dividing the received file into a plurality of blocks, generating and storing a first index and a second index in the plurality of blocks, receiving a query of a client, Selecting a specific block and processing the specific block, and extracting and processing specific data in the specific block based on the second index.

특정 입력 대기시간이 지나거나 상기 제2 인덱스가 변경되면, 상기 제1 인덱스가 업데이트될 수 있다.If the specific input waiting time has passed or the second index is changed, the first index may be updated.

상기와 같은 본 발명에 따르면, 블록데이터에 인덱스를 생성하고 인덱스를 기초로 필요한 블록만 필터링하여 데이터 처리 속도를 향상시킬 수 있다.According to the present invention, it is possible to improve the data processing speed by generating an index on the block data and filtering only necessary blocks based on the index.

도 1은 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템의 개략적인 구성을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 심전도 신호 분류 시스템에서 블록에 포함된 인덱스를 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템에서 개략적인 제1 인덱스 생성과정을 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템에서 구체적인 제1 인덱스 생성과정을 나타내는 도면이다.
도 5 내지 도 6은 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템의 구체적인 데이터 처리 과정을 나타내는 도면이다.FIG. 1 is a diagram showing a schematic configuration of an index-based data processing system according to an embodiment of the present invention.
2 is a diagram illustrating an index included in a block in an electrocardiogram signal classification system according to an embodiment of the present invention.
3 is a diagram illustrating a process of generating a first index in the index-based data processing system according to an embodiment of the present invention.
4 is a diagram illustrating a process of generating a first index in an index-based data processing system according to an embodiment of the present invention.
5 to 6 are views illustrating a specific data processing procedure of the index-based data processing system according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. The terms " comprises "and / or" comprising "used in the specification do not exclude the presence or addition of one or more other elements in addition to the stated element.

이하, 도면을 참조하여 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템에 대해 설명하기로 한다.Hereinafter, an index-based data processing system according to an embodiment of the present invention will be described with reference to the drawings.

도 1 내지 도 2를 참조하면, 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템의 기본 구성이 개시된다. 도 1은 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템의 개략적인 구성을 나타내는 도면, 도 2는 본 발명의 일 실시예에 따른 심전도 신호 분류 시스템에서 블록에 포함된 인덱스를 나타내는 도면, 도 3은 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템에서 개략적인 제1 인덱스 생성과정을 나타내는 도면, 도 4는 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템에서 구체적인 제1 인덱스 생성과정을 나타내는 도면, 도 5 내지 도 6은 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템의 구체적인 데이터 처리 과정을 나타내는 도면이다.1 and 2, a basic configuration of an index-based data processing system according to an embodiment of the present invention is disclosed. FIG. 1 is a diagram illustrating a schematic configuration of an index-based data processing system according to an embodiment of the present invention. FIG. 2 is a diagram illustrating an index included in a block in an ECG signal classification system according to an embodiment of the present invention. 3 is a diagram illustrating a first index generation process in an index-based data processing system according to an embodiment of the present invention. FIG. 4 is a flowchart illustrating a first index generation process in an index-based data processing system according to an embodiment of the present invention. And FIGS. 5 to 6 illustrate a detailed data processing process of the index-based data processing system according to an embodiment of the present invention.

본 실시예에 따른 인덱스 기반 데이터 처리 시스템은, 데이터베이스부(100), 인덱스 생성부(200)와 처리부(300)를 포함할 수 있다.The index-based data processing system according to the present embodiment may include a database unit 100, an index generation unit 200, and a processing unit 300.

구체적으로, 본 실시예에 따른 인덱스 기반 데이처 처리 시스템은, 수신된 파일을 복수의 블록으로 분할하여 저장하는 데이터베이스부(100), 상기 데이터베이스부(100)에 포함되어 상기 복수의 블록에 제1 인덱스(210)와 제2 인덱스(220)를 생성하는 인덱스 생성부(200) 및 쿼리에 따라 상기 복수의 블록을 처리하는 처리부(300)를 포함하되, 상기 처리부(300)는 상기 제2 인덱스(220)를 기초로 상기 특정블록에서 특정 데이터를 추출하여 처리한다. Specifically, the index-based data processing system according to the present embodiment includes a database unit 100 that divides a received file into a plurality of blocks and stores the divided files, An index generator 200 for generating an index 210 and a second index 220 and a processor 300 for processing the plurality of blocks according to a query, And extracts and processes the specific data in the specific block.

웹 환경은 기존의 방식으로는 효과적으로 처리하기 어려운 대규모 데이터가 존재하는 대표적인 곳이다. 웹에서 검색이란 규격이 일정하지 않은 여러 종류의 데이터가 대규모로 쌓여 있는 데이터 더미에서 원하는 내용을 효과적으로 빠른 시간 안에 찾는 것이 필수적이다. The Web environment is a typical example of large-scale data that is difficult to effectively handle in the conventional way. Search on the Web is essential to find out what you want quickly and efficiently in a heap of data that has a large amount of unstructured data.

이를 위해 본 발명의 일 실시예에 따른 데이터베이스부(100)는 분산파일 시스템(DFS)으로 구성할 수 있다. 분산파일 시스템(DFS)은 여러 대의 컴퓨터를 조합해 대규모 기억장치를 만드는 기술이다. 웹 검색엔진의 경우 전 세계에 존재하는 엄청난 규모의 웹 페이지를 저장해야 한다. 인터넷 상 데이터는 그 증가 속도가 매우 빠르기 때문에 대규모 데이터를 안전하게 저장하고 효율적으로 처리하기 위해서는 다수의 하드디스크를 조합해 데이터를 저장한다. For this, the database unit 100 according to an embodiment of the present invention can be configured as a distributed file system (DFS). The Distributed File System (DFS) is a technology that combines multiple computers to create a mass storage device. In the case of web search engines, you need to store huge amounts of web pages that exist all over the world. Since the growth rate of the Internet data is very fast, a large number of hard disks are combined to store data in order to securely store and efficiently process large-scale data.

데이터베이스부(100)는 저렴한 하드웨어를 대량으로 이용하기 때문에 고장 발생을 전제로 시스템을 설계할 수 있다. 분산파일 시스템은 이를 위해 항상 파일을 여러 개 복사해 저장할 수 있다. 또한 파일의 내용과 위치에 대한 정보도 여러 개의 복사본을 만들어 저장할 수 있다. 이렇게 파일의 내용과 정보가 복수의 클라이언트에 분산 저장되기 때문에 검색 시간도 단축되고 여러 곳에서 동시에 검색이 이루어져도 특정 클라이언트에 작업량이 집중되지 않는다. 예를 들어 한국에 있는 이용자가 특정 단어를 검색하면 저장된 복수의 정보 중에서 이용자와 가장 가까운 곳에 있는 정보를 찾아내 검색하게 된다. 특정 클라이언트가 고장이 나더라도 기존의 정보는 다른 곳에 복사본이 존재하기 때문에 데이터 손실의 염려가 없다.Since the database unit 100 uses inexpensive hardware in a large amount, the system can be designed on the premise that a failure occurs. Distributed file systems can always copy and store multiple files for this purpose. You can also create multiple copies of information about the contents and location of the files. In this way, the content and information of the file are distributed to a plurality of clients, so that the search time is shortened. For example, when a user in Korea searches for a specific word, the user searches for the information closest to the user among the plurality of stored information. Even if a particular client fails, the existing information does not have to be lost because there is a copy elsewhere.

또한 데이터베이스부(100)는 수신된 파일을 복수의 블록으로 분할하여 관리하고, 블록은 테이블 구조를 이용하여 관리할 수 있다. 테이블에 속한 블록들은 높은 유사도를 가지며, 데이터 처리시 필요한 테이블을 선택하여 데이터 처리속도로를 향상시킬 수 있다. 이를 위해 데이터베이스부(100)는 정형 또는 비정형 구조를 이용해 데이터를 플렉서블(flexcible)하게 구현할 수 있다.Also, the database unit 100 manages the received file by dividing the received file into a plurality of blocks and managing the blocks using a table structure. The blocks belonging to the table have a high degree of similarity, and the data processing speed can be improved by selecting a table necessary for data processing. To this end, the database unit 100 can flexibly implement data using a fixed or irregular structure.

예를 들어, 문서형식으로된 데이터의 경우 수신된 문서파일은 비정형 구조로 이루어져 있다. 이러한 문서파일을 자연어데이터, 표제데이터등의 기준을 이용하여 데이터로 나누어 정형화된 구조로 관리할 수 있다.For example, in the case of data in document format, the received document file has an irregular structure. Such a document file can be divided into data by using criteria such as natural language data and heading data, and can be managed in a structured structure.

인덱스 생성부(200)는 데이터베이스부(100)에 저장된 블록에 제1 인덱스(210)와 제2 인덱스(220)를 생성할 수 있다. 제1 인덱스(210)는 블록의 정보를 제공하고, 제2 인덱스(220)는 블록에 포함된 데이터의 정보를 제공할 수 있다. 구체적으로, 데이터베이스부(100)는 수신받은 파일을 복수의 블록으로 나누어 관리할 수 있다. 블록은 다시 복수의 데이터로 구성되어 있고, 복수의 데이터는 전술한 바와 같이 정형, 비정형 또는 그의 중간과정인 반정형 구조로 구성될 수 있다. 이때, 데이터에 제2 인덱스(220)를 생성함으로써, 불필요한 데이터를 필터링하여 처리속도를 향상시킬 수 있다. 인덱스 생성부(220)에서 제1 인덱스(210)와 제2 인덱스(220)의 구체적인 생성과정은 후술하기로 한다.The index generator 200 may generate a first index 210 and a second index 220 in a block stored in the database unit 100. [ The first index 210 provides the information of the block, and the second index 220 provides the information of the data included in the block. Specifically, the database unit 100 can manage the received file by dividing the file into a plurality of blocks. The block is composed of a plurality of pieces of data again, and the plurality of pieces of data may be composed of a semi-regular structure which is a regular, irregular or intermediate process as described above. At this time, by generating the second index 220 on the data, unnecessary data can be filtered to improve the processing speed. A specific process of generating the first index 210 and the second index 220 in the index generating unit 220 will be described later.

처리부(300)는 클라이언트의 쿼리에 따라 데이터를 처리할 수 있으며, 제1 처리부(310)와 제2 처리부(320)를 포함할 수 있다.The processing unit 300 may process data according to a query of a client and may include a first processing unit 310 and a second processing unit 320.

제1 처리부(310)에서는 대규모 데이터를 복수의 클라이언트에 분산해 병렬적으로 처리해 새로운 데이터(중간 결과)를 만들어낸다. 제2 처리부(320)에서는 이렇게 생성된 중간 결과물을 결합해 최종적으로 원하는 결과를 생산한다. 제2 처리부 역시 복수의 클라이언트를 동시에 활용하는 분산처리 방식을 적용한다. 이를 위해 처리부(300) 맵리듀스 프레임워크를 이용할 수 있으나 이에 한정되는 것은 아니며, 빅데이터와 같은 대용량 데이터를 빠르고 안전하게 처리하기 위한 다양한 프레임워크로 구현될 수 있다.In the first processing unit 310, large-scale data is distributed to a plurality of clients and processed in parallel to generate new data (intermediate result). The second processing unit 320 combines the intermediate results thus generated and finally produces a desired result. The second processing unit also applies a distributed processing method that uses a plurality of clients at the same time. For this, the mapping unit framework of the processing unit 300 can be used, but the present invention is not limited thereto, and can be implemented in various frameworks for quickly and safely processing large data such as big data.

도 3 을 참조하면, 본 발명의 일 실시예에 따른 개략적인 인덱스 생성 과정이 개시된다.Referring to FIG. 3, a schematic index generation process according to an embodiment of the present invention is disclosed.

먼저, 클라이언트에서 파일이 업로드 되고(S10), 데이터베이스부는 파일을 복수의 블록으로 나누는 블록화단계가 수행된다(S20). 나누어진 복수의 블록은 인덱스를 생성하는 인덱스 생성부로 전송된다(S30).First, the file is uploaded from the client (S10), and the database unit performs a blocking step of dividing the file into a plurality of blocks (S20). A plurality of divided blocks are transmitted to an index generator for generating an index (S30).

인덱스 생성부는 각 블록에 제2 인덱스를 생성하는 단계를 수행한다(S31). 제2 인덱스는 블록에 포함된 데이터의 정보를 제공하는 인덱스이고, 처리부는 제2 인덱스를 기초로 필요한 데이터만 선택하여 처리할 수 있어 데이터 처리 속도를 향상시킬 수 있다.The index generator performs a step of generating a second index in each block (S31). The second index is an index for providing information on data included in the block, and the processing unit can select and process only necessary data based on the second index, thereby improving the data processing speed.

다음으로, 인덱스 생성부는 생성된 제2 인덱스중에서 후보군을 설정하고(S32), 후보군에서 제1 인덱스를 결정한다(S33). 처리부는 제1 인덱스를 기초로 데이터 처리에 필요한 작업량을 결정할 수 있다. 다시 말해, 특정 제1 인덱스를 포함하는 블록을 제외한 나머지 블록을 필터링하여 제1 처리부에서의 데이터 처리속도를 향상시킬 수 있다. 인덱스 생성부는 제2 인덱스 데이터에서 키(key)값을 이용하여 제1 인덱스를 생성하며, 구체적인 과정은 도 4를 참조하여 후술하기로 한다.Next, the index generator sets a candidate group in the generated second index (S32), and determines a first index in the candidate group (S33). The processing unit can determine an amount of work required for data processing based on the first index. In other words, the data processing speed in the first processing unit can be improved by filtering the remaining blocks excluding the block including the specific first index. The index generator generates a first index using a key value in the second index data, and a detailed procedure will be described later with reference to FIG.

인덱스 생성부에서 생성된 제1 및 제2 인덱스를 포함한 블록은 다시 데이터베이스부에 전송된다(S40). 전송된 블록은 인덱스의 유사도에 따라 테이블 구조 형식으로 재구성된다.The block including the first and second indexes generated by the index generation unit is transmitted to the database unit again (S40). The transmitted blocks are reconstructed in the form of a table structure according to the degree of similarity of the indexes.

마지막으로 블록을 복제하여 분산저장하고, 블록들의 위치정보 역시 저장하는 단계를 수행한다(S41, S42). 전술한 바와 같이 분산파일시스템은 저렴한 하드웨어를 대량으로 이용하기 때문에 고장 발생을 전제로 시스템을 설계한다. 분산파일 시스템은 이를 위해 항상 블록을 여러 개 복사해 저장할 수 있다. 만약 기존의 블록이 하드웨어의 고장이나 기타 원인으로 인해 삭제되거나 변경되면, 데이터베이스부는 복제된 블록의 위치정보를 바탕으로 복제된 블록을 전송하여 처리중 오류를 최소화 할 수 있다.Finally, the block is replicated and distributedly stored, and the location information of the blocks is also stored (S41, S42). As described above, the distributed file system uses a large amount of inexpensive hardware, so the system is designed on the premise that a failure occurs. Distributed file systems can always copy and store multiple blocks for this purpose. If the existing block is deleted or changed due to a hardware failure or other causes, the database unit can minimize the error during processing by transmitting the copied block based on the position information of the replicated block.

도 4 를 참조하면, 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템의 인덱스 생성부에서 생성되는 제1 인덱스의 구체적인 생성과정이 나타난다. Referring to FIG. 4, a concrete procedure of generating the first index generated by the index generator of the index-based data processing system according to an embodiment of the present invention is shown.

먼저, 인덱스 생성부가 데이터베이스부에 저장된 블록에서 제2 인덱스를 수집하여 인덱스 그룹을 생성하는 단계를 수행한다(S100).First, the index generating unit collects the second index in the block stored in the database unit and generates an index group (S100).

인덱스를 확인하여 인덱스가 비어있지 않으면 제2 인덱스에서 식별 가능한 키(key)를 선택하고 블록 정보를 포함하는 키를 병합한다(S300). 선택한 키가 식별 가능한 키인가를 구분하는 조건은 다음과 같다;If the index is not empty, the key is identified in the second index, and the key including the block information is merged (S300). The conditions for distinguishing whether a selected key is an identifiable key are as follows:

1. 키가 수집된 모든 제2 인덱스에 포함되지 않아야 한다.1. The key must not be included in every second index collected.

2. 0을 초과하고 1미만인 한계값이 할당되면(0<한계값<1), 키를 포함하는 제2 인덱스의 수는 제2 인덱스의 총합과 한계값의 곱 보다 작아야한다.2. If a threshold value exceeding 0 and less than 1 is assigned (0 <threshold value <1), the number of second indexes containing the key must be less than the product of the sum of the second indexes and the threshold value.

키의 조건을 판단하여(S400), 키가 한계값보다 작으면 키를 포함한 제2 인덱스를 제1 인덱스 후보군으로 설정한다(S500). 만약 키가 한계값보다 크면 해당 키를 포함한 제2 인덱스를 제2 인덱스 그룹에서 제외한다(S410).(S400). If the key is smaller than the threshold value, the second index including the key is set as the first index candidate group (S500). If the key is greater than the limit value, the second index including the key is excluded from the second index group (S410).

이러한 과정을 반복 수행하여 얻은 키를 이용하여 후보군에서 제1 인덱스를 생성할 수 있다(S600).The first index may be generated in the candidate group using the key obtained by repeating this process (S600).

도 5 내지 도 6 을 참조하면, 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템의 구체적인 데이터 처리 과정이 나타난다.5 to 6, a specific data processing process of the index-based data processing system according to an embodiment of the present invention is shown.

파일을 수신하면, 데이터베이스부(100)는 파일을 블록으로 나누어 인덱스 생성부(200)에 전송할 수 있다. 인덱스 생성부(200)는 각 블록에 제1 인덱스 및 제2 인덱스를 생성하고, 데이터베이스부(100)는 생성된 제1 및 제2 인덱스를 이용하여 데이터를 데이블 구조로 저장할 수 있다. 이때, 수신된 블록을 복제하여 분산저장하고, 분산저장된 블록의 위치정보 또한 저장할 수 있다.Upon receipt of the file, the database unit 100 may divide the file into blocks and transmit the blocks to the index generator 200. The index generator 200 generates a first index and a second index in each block, and the database unit 100 may store the data in a table structure using the generated first and second indexes. At this time, the received blocks may be duplicated and distributedly stored, and the location information of the distributed blocks may also be stored.

사용자의 쿼리를 수신하면, 데이터베이스부(100)는 수신된 쿼리와 제1 인덱스를 기초로 데이터를 선택할 수 있다. 전술한 바와 같이 기존의 맵리듀스 시스템은 모든 데이터를 처리해야 하지만, 본 발명의 일 실시예에 따른 인덱스 기반 데이터 처리 시스템에서는 제1 인덱스를 이용하여 원하는 블록만을 선택적으로 처리할 수 있다. 제1 인덱스를 포함하는 블록을 제외한 나머지 블록을 필터링한 다음, 해당 블록들은 처리부(300)에 전송될 수 있다. Upon receiving the user's query, the database unit 100 can select the data based on the received query and the first index. In the index-based data processing system according to an exemplary embodiment of the present invention, the existing maple deuce system needs to process all data. However, in the index-based data processing system, only the desired block can be selectively processed using the first index. After filtering the remaining blocks excluding the block including the first index, the blocks may be transmitted to the processing unit 300.

처리부(300)는 수신받은 블록의 제1 인덱스를 기초로 작업량을 결정하여 복수의 클라이언트에 분산하여 전송할 수 있다. 제1 인덱스는 블록의 정보를 제공하고, 블록의 정보는 파일의 크기, 위치정보 등의 다양한 정보를 포함할 수 있다.The processing unit 300 may determine the workload based on the first index of the received block and distribute the workload to a plurality of clients. The first index provides the information of the block, and the information of the block may include various information such as the size of the file, position information, and the like.

제1 처리부(310)는 각 블록의 제2 인덱스를 이용하여 데이터 처리에 필요한 데이터만 추출할 수 있다. 전술한 바와 같이 제2 인덱스는 데이터의 정보를 제공할 수 있어 불필요한 연산과정을 줄일 수 있다. 제1 처리부(310)를 거쳐 생성된 중간 결과를 제2 처리부(320)가 다시 합산하여 최종결과를 산출할 수 있다.The first processing unit 310 can extract only data necessary for data processing using the second index of each block. As described above, the second index can provide information of data, thereby reducing unnecessary operations. The intermediate result generated through the first processing unit 310 can be re-summed by the second processing unit 320 to calculate the final result.

예를 들어, 특정 웹사이트에서 특정 아이피(IP)가 출현한 횟수를 카운팅하는 쿼리를 수신하는 경우, 데이터베이스부(100)는 수신된 쿼리에 부합하는 제1 인덱스를 검색하여 저장된 복수의 파일 중 웹사이트 로그 데이터를 선택할 수 있다. 웹사이트 로그 데이터는 데이터베이스부(100)와 인덱스 생성부(200)의 전처리 과정에 따라 복수의 블록으로 나누어져 분산 저장되어 있을 수 있다. For example, when receiving a query counting the number of occurrences of a specific IP (IP) in a specific web site, the database unit 100 searches for a first index corresponding to the received query, Site log data can be selected. The web site log data may be divided and stored in a plurality of blocks in accordance with a preprocessing process of the database unit 100 and the index generating unit 200.

웹사이트 로그 데이터에 포함된 제1 인덱스에 따라, 처리부(300)는 데이터 처리에 필요한 작업량을 할당하고, 블록을 분산하여 병렬처리하는 제1 처리부(310)에 전송할 수 있다.According to the first index included in the web site log data, the processing unit 300 may allocate the amount of work required for data processing, and may transmit the amount of work to the first processing unit 310 that distributes and parallelizes the blocks.

제1 처리부(310)는 쿼리에 따라 블록에 포함된 데이터를 이용하여 특정 아이피를 검색하여 카운팅한다. 블록은 웹사이트 주소, 사용자 아이피, 접속시간 등의 다양한 데이터를 포함할 수 있다. 본 실시예에 따른 데이터 처리 과정에서는 아이피를 포함하는 데이터만 필요하므로, 제2 인덱스를 이용하여 복수의 데이터 중 아이피를 포함하는 데이터만 추출하여 처리할 수 있다.The first processor 310 searches for a specific IP using the data included in the block according to the query and counts the specific IP. The block may include various data such as a web site address, user IP, access time, and the like. Since only the data including the IP is required in the data processing process according to this embodiment, only the data including the IP among the plurality of data can be extracted and processed using the second index.

병렬배치된 복수의 클라이언트에서 각 블록의 아이피 데이터를 검색하여 특정 아이피가 출현하는 경우 카운트 1의 신호를 제2 처리부(320)에 전송할 수 있다. 제2 처리부(320)는 제1 처리부(310)의 중간결과를 합산하여 최종적으로 특정 웹사이트에서 특정 아이피의 출현횟수를 산출할 수 있다.A plurality of clients arranged in parallel can retrieve the IP data of each block and transmit a signal of count 1 to the second processor 320 when a specific IP appears. The second processor 320 may calculate the number of occurrences of a specific IP in a specific web site by summing intermediate results of the first processor 310.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

100: 데이터베이스부
200: 인덱스 생성부
210: 제1 인덱스
220: 제2 인덱스
300: 처리부
310: 제1 처리부
320: 제2 처리부100:
200: index generation unit
210: 1st index
220: second index
300:
310: first processing section
320:

Claims

A database unit that divides the received file into a plurality of blocks and stores the divided files;
An index generating unit included in the database unit and generating a plurality of indexes in the plurality of blocks; And
And a processor for processing the plurality of blocks according to a query,
And the processing unit extracts and processes specific data in the specific block based on the plurality of indexes.

The method according to claim 1,
Wherein the plurality of indices comprises a first index and a second index.

3. The method of claim 2,
Wherein the first index provides information of a block and the second index provides data information of the block.

The method according to claim 1,
Wherein the plurality of blocks are configured in a structured or unstructured structure.

The method according to claim 1,
Wherein the index generator collects a second index in the plurality of blocks and sets a candidate group in the second index to generate the first index in the candidate group.

6. The method of claim 5,
Wherein the first index is updated after a specific input waiting time or when the second index is changed.

The method according to claim 1,
Wherein the database unit is configured as a Distributed File System (DFS).

8. The method of claim 7,
Wherein the database unit replicates and stores the block when the block is stored, and stores the location information of the block and the replicated block.

The method according to claim 1,
Wherein the processing unit includes a first processing unit and a second processing unit.

10. The method of claim 9,
And the first processing unit distributes and parallelizes the specific data.

10. The method of claim 9,
And the second processing unit re-sums the processed data to produce a result.

10. The method of claim 9,
Wherein the processing unit determines the parallel distributed workload of the first processing unit based on the first index.

13. The method of claim 12,
Wherein the processing unit determines the amount of work based on the second index when the index generating unit fails to generate the first index.

Dividing the received file into a plurality of blocks;
Generating and storing a first index and a second index in the plurality of blocks;
Receiving a query from a client;
Selecting a specific block including the first index according to the query;
And processing the particular block,
And extracting and processing specific data in the specific block based on the second index.

15. The method of claim 14,
Wherein the first index provides information of a block, and the second index provides data information of the block.

15. The method of claim 14,
Wherein the index generator collects a second index in the plurality of blocks and sets a candidate group in the second index to generate the first index in the candidate group.

17. The method of claim 16,
Wherein the first index is updated when a specific input waiting time passes or when the second index is changed.