KR102049420B1

KR102049420B1 - Method for parallel query processing of data comprising a replica in distributed database

Info

Publication number: KR102049420B1
Application number: KR1020180035203A
Authority: KR
Inventors: 최재용; 정태균; 백성인; 한혁; 진성일
Original assignee: 주식회사 리얼타임테크
Priority date: 2018-03-27
Filing date: 2018-03-27
Publication date: 2019-11-27
Also published as: KR20190113055A; WO2019189962A1

Abstract

본 발명은 분산 데이터베이스에서 복제본이 존재하는 데이터에 대한 질의를 복제본이 저장된 노드의 수에 대응하여 여러개의 범위 조건 질의로 변경하고, 다수의 노드에서 범위 조건 질의를 동시 수행함으로써, 복제본이 존재하는 대용량 분산 데이터에 대한 질의 수행 시간을 단축시킬 수 있도록 해주는 기술에 관한 것이다.
본 발명에 따른 분산 데이터베이스의 복제본이 존재하는 테이블에 대한 질의 병렬화 방법은 마스터 서버에서 클라이언트로부터 요청된 질의를 분석하여 타겟 테이블이 분산 테이블이면서, 타켓 테이블이 범위를 지정할 수 있는 컬럼을 포함하는 경우 해당 질의를 분할 대상 질의로 판단하는 제1 단계와, 마스터 서버에서 분할 대상 질의에 대한 검색 대상 레코드 수가 기 설정된 기준 레코드 수를 초과하는지를 판단하는 제2 단계, 마스터 서버에서 제2 단계에서 검색 대상 레코드 수가 기 설정된 기준 레코드 수를 초과하는 경우, 원본 및 복사본이 존재하는 슬레이브 서버를 확인하는 제3 단계, 마스터 서버에서 검색 대상 레코드 수와 복사본이 존재하는 슬레이브 서버 수를 근거로 레코드 영역을 분할하여 각 슬레이브 서버에 대한 범위 질의를 생성하는 제4 단계 및, 마스터 서버에서 각 슬레이브 서버에 병렬로 범위 질의를 동시 전송하고, 각 슬레이브 서버로부터 해당 범위 질의 실행 결과 데이터를 수집하여 병합함으로써, 클라이언트로부터 요청된 질의에 대한 결과를 생성하는 제5 단계를 포함하여 구성되는 것을 특징으로 한다.The present invention changes a query for data in which a replica exists in a distributed database into a plurality of range condition queries corresponding to the number of nodes in which a replica is stored, and simultaneously executes a range condition query in multiple nodes, thereby allowing a large amount of replicas to exist. The present invention relates to a technique for reducing query execution time for distributed data.
According to the present invention, a query parallelization method for a table in which a replica of a distributed database exists is performed by analyzing a query requested from a client in a master server, and when the target table is a distributed table and the target table includes columns that can specify a range. A first step of determining the query as a split target query; a second step of determining whether the number of search target records for the split target query in the master server exceeds a preset reference record number; In case of exceeding the preset reference record number, the third step of checking the slave server where the source and the copy exist, the record area is divided by the master server based on the number of records to be searched and the number of the slave server where the copy exists. To generate a range query to the server And a fifth step of simultaneously transmitting a range query from the master server to each slave server and collecting and merging corresponding range query execution result data from each slave server, thereby generating a result for the query requested from the client. Characterized in that comprises a step.

Description

Method for parallel query processing of data comprising a replica in distributed database}

본 발명은 분산 데이터베이스에서 복제본이 존재하는 데이터에 대한 질의를 복제본이 저장된 노드의 수에 대응하여 여러개의 범위 조건 질의로 변경하고, 다수의 노드에서 범위 조건 질의를 동시 수행함으로써, 복제본이 존재하는 대용량 분산 데이터에 대한 질의 수행 시간을 단축시킬 수 있도록 해 주는 기술에 관한 것이다. The present invention changes a query for data in which a replica exists in a distributed database into a plurality of range condition queries corresponding to the number of nodes in which a replica is stored, and simultaneously executes a range condition query in multiple nodes, thereby allowing a large amount of replicas to exist. The present invention relates to a technique for reducing query execution time for distributed data.

유무선 통신 기술의 발달, 및 컴퓨터 관련 기술의 발달에 따라 데이터를 효과적으로 관리하는 기술에 관한 연구가 이루어지고 있다. With the development of wired and wireless communication technologies and the development of computer-related technologies, researches have been conducted on technologies for effectively managing data.

사용자가 제작한 데이터, 예를 들어 UCC, 사용자 중심 어플리케이션 등의 등장으로 인해 한번에 관리해야 하는 데이터의 양 또한 급속도로 늘어나고 있는 추세이다.With the advent of user-generated data such as UCC and user-centric applications, the amount of data to be managed at once is also increasing rapidly.

또한, 멀티미디어 데이터의 고용량화와 컴퓨터 처리 속도의 발전에 따라 개개의 데이터의 크기 역시 매우 커지고 있다. In addition, with the increase in the capacity of multimedia data and the development of computer processing speed, the size of individual data is also very large.

따라서, 크기와 양에 있어서 모두 급속도로 그 총량이 늘어나고 있는 대용량 데이터에 대한 관리 기술이 절실하게 필요하게 되었다.Therefore, there is an urgent need for a management technique for large data, which is rapidly increasing in both size and quantity.

이러한 대용량의 데이터를 관리하기 위한 시스템으로 분산 데이터베이스 시스템이 존재한다.A distributed database system exists as a system for managing such a large amount of data.

일반적으로 분산 데이터베이스 시스템은 도1에 도시된 바와 같이 마스터 서버(10)와 다수의 슬레이브 서버(20)로 구성된다.In general, a distributed database system includes a master server 10 and a plurality of slave servers 20, as shown in FIG.

마스터 서버(10)는 슬레이브 서버(20)들을 관리하고 데이터가 속한 슬레이브 서버(20)의 위치 등을 관리한다. 슬레이브 서버(20)는 실제 데이터가 속한 파티션을 관리하는 서버이고, 데이터는 키를 기반으로 순차적으로 정렬되어 관리된다.The master server 10 manages the slave servers 20 and manages the position of the slave server 20 to which data belongs. The slave server 20 is a server that manages the partition to which the actual data belongs, and the data is arranged and managed sequentially based on the key.

일반적으로, 분산 데이터베이스에서는 데이터 안정성 및 성능 향상을 위해 파일별로 다수의 복제본(replica)들을 여러 서버에 각각 분산하여 생성 및 관리한다. 이때, 파일의 특성에 따라 복제본을 생성하지 않을 수 있다.In general, a distributed database creates and manages a plurality of replicas distributed to each server for each file to improve data stability and performance. At this time, the replica may not be created according to the characteristics of the file.

즉, 데이터베이스 복제는 하나의 데이터베이스에 저장된 객체를 물리적으로 분리된 다른 데이터베이스에 복사하여 두 개 이상의 데이터베이스 서버에서 사용될 수 있도록 하는 분산 데이터베이스 기술 중의 하나이다. In other words, database replication is one of distributed database technologies that copies an object stored in one database to another physically separate database so that it can be used in two or more database servers.

이러한 복제 기술은 같은 객체를 이용하는 응용 프로그램의 접근을 여러 데이터베이스 서버에 분산시킴으로써 성능을 높이거나 복제된 데이터베이스 서버를 다른 용도로 사용할 수 있도록 하여 서로 다른 운영 요구사항을 만족시킬 수 있다.This replication technology can improve performance by distributing access to applications that use the same object across multiple database servers, or by allowing the replicated database server to be used for other purposes to meet different operational requirements.

또한 운영 중인 데이터베이스 서버에 장애가 발생했을 경우 복제 데이터베이스 서버로 신속하게 대체함으로써 장애 발생시 긴급하게 대처할 수도 있는 등 그 효용성이 매우 크다고 할 수 있다.In addition, when a failure occurs in the operating database server, it can be quickly replaced by a replica database server.

그러나, 상기한 분산 데이터베이스 시스템에서는 클라이언트로부터의 질의 요청에 대해 일반적으로 원본 데이터 또는 복제본 데이터가 저장된 특정한 하나의 슬레이브 서버에서 해당 질의를 수행하여 결과를 획득한다.However, in the distributed database system, a query is requested from a client and a result is obtained by executing a corresponding query in a specific slave server in which original data or replica data is generally stored.

이에 따라 질의 검색 대상 레코드가 일정 수 이상인 경우, 하나의 슬레이브 서버에서 해당 질의를 실행하여 그 결과를 획득하게 되므로, 결과 획득을 위한 소요 시간이 많아지게 되고, 이는 결국 시스템의 전체적인 성능이 저하되는 문제를 야기한다. Accordingly, when the number of query search target records is greater than or equal to a certain number, since the corresponding query is obtained from one slave server and the result is obtained, the time required for obtaining the result increases, which in turn degrades the overall performance of the system. Cause.

즉, 실제적으로 분산 데이터베이스에서 고가용성을 위해 데이터를 다수의 슬레이브 서버(노드)에 저장하지만, 장애가 발생하지 않는 이상 복제본들에 대한 활용도는 낮은 실정이다.In other words, the data is actually stored in multiple slave servers (nodes) for high availability in a distributed database, but the utilization of replicas is low unless a failure occurs.

1. 한국등록특허 제10-1666064호 (명칭 : 분산파일시스템에서 URL 정보를 이용한 데이터 관리 장치 및 그 방법)1. Korean Registered Patent No. 10-1666064 (Name: Data management device and method using URL information in distributed file system)

이에, 본 발명은 상기한 사정을 감안하여 창출된 것으로, 분산 데이터베이스에서 복제본이 존재하는 테이블에 대해서는 질의를 여러개의 범위 조건 질의로 변경하여 다수의 노드를 통해 동시에 질의를 실행함으로써, 복제본이 존재하는 대용량 분산 테이블에 대한 질의 실행 시간을 단축시킬 수 있도록 해 주는 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법을 제공함에 그 기술적 목적이 있다. Accordingly, the present invention was created in view of the above circumstances, and by changing a query into a plurality of range condition queries for a table in which a replica exists in a distributed database, and simultaneously executing the query through a plurality of nodes, the replica exists. Its technical purpose is to provide a query parallelization method for data in which there is a replica in a distributed database that can reduce the query execution time for large distributed tables.

상기 목적을 달성하기 위한 본 발명의 일측면에 따르면, 클라이언트로부터의 질의 요청에 대해 질의를 분배하는 마스터 서버와, 데이터 테이블이 저장되고 질의를 수행하여 그 결과를 반환하는 다수의 슬레이브 서버들을 포함하여 구성되는 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법에 있어서, 마스터 서버에서 클라이언트로부터 요청된 질의의 구문 분석을 통해 "PARTITION BY" 구문이 포함된 경우 타켓 테이블이 분산 테이블인 것으로 판단하고, 복제본이 존재하는 분산 테이블이 범위를 지정할 수 있는 컬럼을 포함하는 경우 해당 질의를 분할 대상 질의로 판단하는 제1 단계와, 마스터 서버에서 분할 대상 질의에 대한 검색 대상 레코드 수가 기 설정된 기준 레코드 수를 초과하는지를 판단하되, 기준 레코드 수는 질의 실행 시간을 근거로 분할 대상 질의의 범위 조건 컬럼명에 따라 다르게 설정되는 제2 단계, 마스터 서버에서 제2 단계에서 검색 대상 레코드 수가 기 설정된 기준 레코드 수를 초과하는 경우, 원본 및 복사본이 존재하는 슬레이브 서버를 확인하는 제3 단계, 마스터 서버에서 원본 및 복사본이 존재하는 슬레이브 서버 수와 검색 대상 레코드 수를 근거로 레코드 영역을 분할하여 원본 및 복사본이 존재하는 각 슬레이브 서버에서 실행할 서로 다른 레코드 영역의 범위 질의를 생성하되, 마스터 서버는 복사본이 저장된 슬레이브 서버의 현재 부하량을 근거로 부하량이 클수록 해당 슬레이브 서버로 제공할 범위 질의의 레코드 범위를 작게 설정하는 제4 단계 및, 마스터 서버에서 각 슬레이브 서버에 병렬로 범위 질의를 동시 전송하고, 각 슬레이브 서버로부터 해당 범위 질의 실행 결과 데이터를 수집하여 병합함으로써, 클라이언트로부터 요청된 질의에 대한 결과를 생성하는 제5 단계를 포함하여 구성되는 것을 특징으로 하는 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법이 제공된다.According to an aspect of the present invention for achieving the above object, including a master server for distributing a query for a query request from the client, and a plurality of slave servers are stored in the data table to perform the query and return the results In the query parallelization method for data in which a replica in a distributed database is configured, if the "PARTITION BY" syntax is included through parsing of a query requested from a client in the master server, the target table is determined to be a distributed table. If the distributed table in which the replica exists contains a column that can be specified, the first step of determining the query as a split target query and the reference number of records to be searched for the split target query in the master server To determine whether the number of reference records In the second step, which is set differently according to the column condition column name of the split target query based on the execution time, when the number of records to be searched in the second step in the master server exceeds the preset reference number of records, the slave having the original and the copy exists. The third step in identifying the server is to split the record area based on the number of slave servers on which the source and copy exist and the number of records to be searched on the master server so that the range of different record areas to run on each slave server on which the source and copy exist. In the fourth step of generating a query, the master server sets a smaller record range of the range query to be provided to the slave server based on the current load of the slave server in which the copy is stored, and the master server is parallel to each slave server. Send range queries simultaneously to each slave server And a fifth step of generating a result of the query requested from the client by collecting and merging the corresponding range query execution result data from the query. This is provided.

삭제delete

또한, 상기 제1 단계에서 마스터 서버는 타겟 테이블에 데이터 타입이 숫자(INT)이거나 날짜(DATE)를 포함하는 컬럼이 존재하는 경우 분할 대상 질의로 판단하는 것을 특징으로 하는 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법이 제공된다.Further, in the first step, the master server determines that the partition target query is a partition target query when the data type is a number (INT) or a column including a date (DATE) in the target table. A query parallelization method for data is provided.

삭제delete

또한, 상기 제4 단계에서 마스터 서버는 질의의 Where 조건에서 샤딩 조건 컬럼을 범위 조건 컬럼으로 변환하고, 분할된 레코드 영역에 해당하는 범위를 범위 조건 컬럼 값으로 설정함으로써, 각 슬레이브 서버에 대해 서로 다른 범위 조건 컬럼값을 갖는 범위 질의를 생성하는 것을 특징으로 하는 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법이 제공된다.In addition, in the fourth step, the master server converts the sharding condition column into a range condition column in the where condition of the query, and sets a range corresponding to the divided record area as the range condition column value, thereby differenting each slave server. A query parallelization method is provided for data in which a replica exists in a distributed database, which generates a range query having a range condition column value.

본 발명에 의하면, 분산 데이터베이스에서 복제본이 존재하는 테이블에 대해서는 질의를 여러개의 범위 조건 질의로 변경하여 해당 타켓 테이블이 존재하는 다수의 노드에서 서로 다른 질의 범위에 대한 질의 실행을 동시에 수행함으로써, 복제본이 존재하는 대용량 분산 테이블에 대한 질의 실행 시간을 단축시킬 수 있다. According to the present invention, by replicating a query for a table in which a replica exists in a distributed database and executing a query for a different query range at the same time in a plurality of nodes where a target table exists, the replica is executed. You can shorten the query execution time for existing large distributed tables.

도1은 일반적인 분산 데이터베이스 구성을 설명하기 위한 개념도.
도2는 본 발명이 적용되는 복제본이 존재하는 데이터에 대한 질의 병렬화 기능을 갖는 분산 데이터베이스의 구성을 설명하기 위한 도면.
도3은 본 발명의 제1 실시예에 따른 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법을 설명하기 위한 도면.
도4는 도3에서 원본 질의를 다수의 범위 질의로 변환하는 과정을 예시한 도면.1 is a conceptual diagram illustrating a general distributed database configuration.
Fig. 2 is a diagram for explaining the configuration of a distributed database having a query parallelizing function for data in which a replica exists to which the present invention is applied.
3 is a diagram for explaining a query parallelizing method for data in which a replica exists in a distributed database according to a first embodiment of the present invention.
4 is a diagram illustrating a process of converting an original query into a plurality of range queries in FIG.

이하에서는 첨부된 도면을 참조하여 본 발명을 보다 상세하게 설명한다. 도면들 중 동일한 구성요소들은 가능한 한 어느 곳에서든지 동일한 부호로 나타내고 있음을 유의해야 한다. 한편, 이에 앞서 본 명세서 및 특허청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, with reference to the accompanying drawings will be described in detail the present invention. It should be noted that the same elements in the figures are denoted by the same reference signs wherever possible. On the other hand, the terms or words used in the present specification and claims are not to be construed as limiting the ordinary or dictionary meanings, the inventors should use the concept of the term in order to explain the invention in the best way. It should be interpreted as meanings and concepts corresponding to the technical idea of the present invention based on the principle that it can be properly defined. Therefore, the embodiments described in the present specification and the configuration shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all of the technical ideas of the present invention, and various alternatives may be substituted at the time of the present application. It should be understood that there may be equivalents and variations.

도2는 본 발명이 적용되는 복제본이 존재하는 데이터에 대한 질의 병렬화 기능을 갖는 분산 데이터베이스의 구성을 설명하기 위한 도면이다. FIG. 2 is a diagram for explaining the configuration of a distributed database having a query parallelization function for data in which a copy of the present invention is applied.

도2에 도시된 바와 같이 본 발명이 적용되는 복제본이 존재하는 테이블에 대한 질의 병렬화 기능을 갖는 분산 데이터베이스는 클라이언트로부터의 질의 요청에 대해 질의를 분배하고, 이에 대하여 슬레이브 서버(200)로부터 제공되는 결과를 병합하여 클라이언트로 제공하는 마스터 서버(100)와, 데이터가 저장되고 질의를 수행하여 그 결과를 반환하는 다수의 슬레이브 서버(200)들을 포함하여 구성된다.As shown in FIG. 2, a distributed database having a query parallelization function for a table having a replica to which the present invention is applied distributes a query to a query request from a client, and the result provided from the slave server 200 is provided. It is configured to include a master server 100 for merging and providing to the client, and a plurality of slave servers 200 to store the data and to perform a query and return the result.

여기서, 상기 슬레이브 서버(200)들은 원본 데이터가 저장되는 데이터베이스 서버와 복제본을 저장하는 데이터베이스 서버를 포함한다. 또한, 원본 데이터는 마스터 데이터베이스(100)에 저장되고 슬레이브 서버(200)는 복제본을 저장하는 복제서버로 이루어질 수 있도 있다. Here, the slave server 200 includes a database server for storing the original data and a database server for storing the replica. In addition, the original data may be stored in the master database 100 and the slave server 200 may be configured as a replica server that stores a replica.

마스터 서버(100)는 질의 분석 모듈(110)과, 질의 최적화 모듈(120), 질의 분할 모듈(130), 질의 분배 모듈(140) 및, 결과 병합 모듈(150)을 포함하여 구성된다.The master server 100 includes a query analysis module 110, a query optimization module 120, a query partitioning module 130, a query distribution module 140, and a result merging module 150.

질의 분석 모듈(110)은 클라이언트로부터 요청된 질의를 파싱하여 명령어를 분석한다. 예컨대, 질의 분석 모듈(110)은 탐색(SELECT 구문), 저장(INSERT 구문), 조인(JOIN 구문) 등의 명령어 타입을 분석한다.The query analysis module 110 parses the query requested from the client to analyze the command. For example, the query analysis module 110 analyzes command types such as search (SELECT syntax), storage (INSERT syntax), join (JOIN syntax), and the like.

질의 최적화 모듈(120)은 클라이언트의 질의를 최적화하고 해당 질의가 분할 대상 질의인지를 분석한다. 질의 최적화 모듈(120)은 구문 분석을 통해 원본 질의의 결과 획득을 위한 타겟(Target) 테이블이 복제 테이블이 존재하는 분산 테이블인지를 확인한다. 이때, 질의에 "PARTITION BY" 구문이 존재하는 경우 분산 테이블로 판단하고, 분산 테이블을 포함하는 질의를 분할 대상 질의로 판단한다. The query optimization module 120 optimizes the client's query and analyzes whether the query is a split target query. The query optimization module 120 determines whether the target table for obtaining a result of the original query is a distributed table in which a replication table exists through syntax analysis. In this case, when the "PARTITION BY" syntax exists in the query, it is determined as a distribution table, and a query including the distribution table is determined as a partition target query.

여기서, 모든 데이터는 테이블(Table) 형태로 데이터베이스에 저장되고, 테이블이란 데이터베이스에서 데이터를 저장하는 기본구조를 말하며, 하나의 테이블은 하나 이상의 레코드(Record)들로 구성된다. Here, all data is stored in a database in the form of a table, and a table is a basic structure for storing data in a database, and one table is composed of one or more records.

또한, 질의 최적화 모듈(120)은 타켓 테이블이 복사본이 존재하는 분산 테이블인 경우에 한하여, 질의의 구문 분석을 통해 해당 질의가 범위 분할 가능한 조건을 포함하는지를 확인하고, 분할 가능 조건의 질의라고 판단되는 경우 해당 질의를 최종적으로 분할 대상 질의로 결정한다. 이때, 질의의 분할 가능 조건은 타입이 숫자(INT)이거나 날짜(DATE)를 포함한다. In addition, the query optimization module 120 checks whether the query includes a range partitionable condition through parsing the query only when the target table is a distributed table in which a copy exists, and determines that the query is a partitionable condition query. In this case, the query is finally determined to be a split target query. In this case, the partitionable condition of the query may include a type (INT) or a date (DATE).

질의 분할 모듈(130)은 분할 대상 질의에 대하여 복사본이 존재하는 슬레이브 서버(200)로 전송할 다수의 범위 질의를 생성한다. 질의 분할 모듈(130)은 분할 대상 질의를 해당 타겟 테이블에 대한 복사본이 저장된 슬레이브 서버(300) 수 즉, 노드 수에 대응하는 수의 범위 질의로 변환한다. 다수의 범위 질의는 where 조건의 컬럼 범위값이 다르게 설정되고, 이 컬럼은 분할 가능 조건에 해당하는 필드로 설정된다. The query splitting module 130 generates a plurality of range queries to be sent to the slave server 200 in which the copy exists for the split target query. The query splitting module 130 converts the split target query into a range query corresponding to the number of slave servers 300, that is, the number of nodes, in which a copy of the target table is stored. For many range queries, the column range of the where condition is set differently, and this column is set to a field corresponding to a splittable condition.

질의 분배 모듈(140)은 쓰레드를 생성하여 각 슬레이브 서버(200)에 범위 질의를 동시 전송한다.The query distribution module 140 creates a thread and simultaneously transmits a range query to each slave server 200.

결과 병합 모듈(150)은 각 슬레이브 서버(200)로부터 범위 질의의 결과들을 수신하여 취합하고 이를 질의 요청한 클라이언트로 제공한다. The result merging module 150 receives the results of the range query from each slave server 200, collects them, and provides them to the query requesting client.

도3은 본 발명의 일 실시예에 따른 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법을 설명하기 위한 도면이다. 3 is a diagram illustrating a query parallelization method for data in which a replica exists in a distributed database according to an embodiment of the present invention.

클라이언트로부터 질의 요청이 마스터 서버(100)로 수신되면, 마스터 서버(100)는 클라이언트로부터 요청된 원본 질의를 파싱하고, 구문 분석하여 분할 대상 질의인지를 판단한다(ST100).When a query request is received from the client to the master server 100, the master server 100 parses and parses the original query requested from the client to determine whether the query is a split target query (ST100).

즉, 마스터 서버(100)는 원본 질의의 타겟 테이블이 슬레이브 서버(200)에 복사본이 존재하는 분산 테이블인지의 여부를 판단한다. 원본 질의에 "PARTITION BY" 구문이 포함된 경우 분산 테이블로 판단한다. That is, the master server 100 determines whether the target table of the original query is a distributed table in which a copy exists in the slave server 200. If the original query includes the phrase "PARTITION BY", it is determined as a distribution table.

그리고, 마스터 서버(100)는 ST100 단계에서 원본 질의의 타겟 테이블이 분산 테이블인 경우, 타겟 테이블, 보다 상세하게는 Where 조건을 분석하여 범위를 지정할 수 있는 컬럼을 포함하는지를 판단한다(ST200). 이때, 마스터 서버(100)는 질의를 구문 분석하여 단일 레코드를 검색하는 유니크 스캔(Unique scan)인지의 여부를 확인함으로써, 범위 지정 질의 여부를 판단할 수 있다. 또한, 마스터 서버(100)는 기 설정된 범위 표현 컬럼조건을 만족하는 경우, 예컨대 데이터 타입이 숫자(INT)이거나 날짜(DATE)가 존재하는 경우 범위 지정 질의로 판단할 수 있다. When the target table of the original query is a distribution table in step ST100, the master server 100 determines whether the target table, more specifically, a condition for analyzing a condition is included and includes a column for specifying a range (ST200). In this case, the master server 100 may determine whether the query is a range designation by parsing the query and checking whether the query is a unique scan that searches a single record. In addition, when the master server 100 satisfies a preset range expression column condition, for example, when the data type is a number (INT) or a date (DATE), the master server 100 may determine the range designation query.

즉, 마스터 서버(100)는 원본 질의가 분산 테이블에 존재하면서, 범위를 지정할 수 있는 컬럼을 포함하는 질의인 경우 해당 원본 질의를 분할 대상 질의로 결정한다.That is, the master server 100 determines that the original query is a split target query when the original query exists in the distribution table and includes a column that can specify a range.

마스터 서버(100)는 상기 ST100 및 ST200 단계를 통해 원본 질의가 분할 대상 질의라고 판단되면, 해당 질의에 영향을 받는 전체 레코드 수 즉, 검색 대상 레코드 수가 기 설정된 레코드 기준 값을 초과하는지의 여부를 판단한다(ST300). 레코드 기준값은 범위 질의로 분할할지의 여부를 판단하기 위한 것으로, 질의에 영향을 받는 전체 레코드 수 즉, 검색 대상 레코드 수가 기 설정된 기준값 미만인 경우에는 원본 질의를 분할하지 않는다. 이때, 레코드 기준값은 조건에 따른 질의 실행 시간을 고려하여 질의 조건에 따라 다르게 설정될 수 있다. 예컨대, 범위 조건이 ID 인 경우에 비해 범위 조건이 날짜인 경우 레코드 기준값이 작게 설정될 수 있다.When the master server 100 determines that the original query is a split target query through the steps ST100 and ST200, the master server 100 determines whether the total number of records affected by the query, that is, the number of search target records exceeds a predetermined record reference value. (ST300). The record reference value is used to determine whether to divide the range query. If the total number of records affected by the query, that is, the number of records to be searched, is less than the preset reference value, the original query is not divided. In this case, the record reference value may be set differently according to the query condition in consideration of the query execution time according to the condition. For example, the record reference value may be set smaller when the range condition is a date than when the range condition is an ID.

이어, 마스터 서버(100)는 ST300 단계에서 검색 대상 레코드 수가 레코드 기준값을 초과하는 경우, 원본 테이블 및 복사본 테이블이 저장된 슬레이브 서버(200)를 확인한다(ST400).Subsequently, when the number of records to be searched exceeds the record reference value in step ST300, the master server 100 checks the slave server 200 in which the original table and the copy table are stored (ST400).

그리고, 마스터 서버(100)는 슬레이브 서버(200) 상태를 근거로 질의 실행 가능한 슬레이브 서버(200) 수에 대응되도록 서로 다른 조건 범위를 갖는 다수의 범위 질의를 생성한다(ST500). 이때, 마스터 서버(100)는 복사본이 저장된 슬레이브 서버(200) 중 현재 부하량이 일정 레벨 이하인 슬레이브 서버(200)를 질의 실행 가능 슬레이브 서버(200)로 설정할 수 있다. The master server 100 generates a plurality of range queries having different condition ranges so as to correspond to the number of slave servers 200 that can execute the query based on the state of the slave server 200 (ST500). In this case, the master server 100 may set the slave server 200 whose current load is less than or equal to a predetermined level among the slave server 200 in which the copy is stored as the query executable slave server 200.

또한, 마스터 서버(100)는 범위 질의는 질의 조건에 대응되는 검색 대상 레코드 수를 질의 실행 가능한 슬레이브 서버(200) 수로 분할하여 서로 다른 검색 대상 레코드에 대한 질의를 실행하도록 생성된다. 여기서, 마스터 서버(100)는 질의 범위를 균등 분할하지 않고, 슬레이브 서버(200)별 현재 부하량을 근거로 서로 다르게 질의 범위를 분할 설정할 수 있다. In addition, the master server 100 generates a range query to execute a query for different search target records by dividing the number of search target records corresponding to the query condition by the number of slave servers 200 that can execute the query. Here, the master server 100 may divide the query range differently based on the current load amount for each slave server 200 without equally dividing the query range.

도4는 원본 질의를 다수의 범위 질의로 분할한 예를 나타낸 도면이다. 도4에서 (A)는 테이블 스키마를 예시한 것이고, (B)는 (A)에 대한 원본 질의(300)가 다수의 범위 질의(310 ~330)로 분할 생성된 것을 예시한 것이다.4 is a diagram illustrating an example of dividing an original query into a plurality of range queries. In FIG. 4, (A) illustrates a table schema, and (B) illustrates that the original query 300 for (A) is divided into a plurality of range queries 310 to 330.

도4에서는 샤딩 키의 데이터가 "LA"에 저장된 원본 데이터가 제1 슬레이브 서버1(Node 1)에 저장되어 있고, 제2 슬레이브 서버(Node 2)와 제3 슬레이브 서버(Node 3)에는 복제본 데이터가 저장되어 있으며, 현재 저장된 id 값이 "0 부터 30000000 까지" 저장되어 있고, 검색 대상 레코드 수는 id 값과 같다고 가정한다.In FIG. 4, original data in which data of the sharding key is stored in “LA” is stored in the first slave server 1 (Node 1), and replica data is stored in the second slave server Node 2 and the third slave server Node 3. Is stored, the currently stored id value is stored from "0 to 30000000", and the number of records to be searched is equal to the id value.

이때, 원본 질의(300)는 타켓 테이블이 총 3개의 슬레이브 서버(Node1 ~Node3)에 저장되므로, 3개의 제1 내지 제3 범위 질의(310 ~ 330)로 생성될 수 있다. In this case, since the target table is stored in three slave servers Node1 to Node3 in total, the original query 300 may be generated as three first to third range queries 310 to 330.

그리고, 원본 질의의 Where 조건에서 샤딩 조건인 loc 컬럼 조건은 범위 조건 컬럼인 id로 변환되면서, id 컬럼 범위를 검색 대상 레코드 수와 슬레이브 서버 수를 근거로 설정한다. 도4 (B)에서는 총 슬레이브 서버가 3개이므로 3개 범위 질의로 나뉘고, 총 검색 대상 레코드 수가 3천만개이므로, 각 노드당 1천만개의 서로 다른 영역의 레코드를 검색하도록 id 범위를 설정하여 각각의 범위 질의를 생성하였다.The loc column condition, which is a sharding condition in the Where condition of the original query, is converted to id, which is a range condition column, and the id column range is set based on the number of records to be searched and the number of slave servers. In FIG. 4 (B), since there are three total slave servers, the query is divided into three range queries. Since the total number of records to be searched is 30 million, an id range is set to search for records of 10 million different areas for each node. You created a range query.

이때, 마스터 서버(100)는 복사본이 저장된 슬레이브 서버((Node1 ~Node3))의 상태, 예컨대 부하 정도 또는 장애 여부 등을 고려하여 슬레이브 서버(Nod1 ~Node3)에 설정되는 id 범위를 서로 다르게 설정할 수 있다. 예컨대, 현재 부하량이 최소인 제1 슬레이브 서버(Node1)에는 1천 5백만개의 레코드 영역을 설정하고, 부하량이 중간인 제2 슬레이브 서버(Node2)에는 1천만개의 레코드 영역을 서정하며, 부하량이 비교적 많은 제3 슬레이브 서버(Node3)에는 5백만개의 레코드 영역을 설정할 수 있다. At this time, the master server 100 may set different id ranges set in the slave servers (Nod1 to Node3) in consideration of the state of the slave servers (Node1 to Node3) in which the copy is stored, for example, the degree of load or failure. have. For example, 15 million record areas are set for the first slave server Node1 having the minimum load, and 10 million record areas are designated for the second slave server Node2 with medium load, and the load is relatively small. Many third slave servers Node3 may have 5 million record areas.

이후, 마스터 서버(100)는 쓰레드를 생성하여 범위 질의를 타겟 테이블 및 그 복사본이 저장된 각 슬레이브 서버(200)로 동시에 병렬형태로 전송한다(ST600). 이때, 마스터 서버(100)는 원본 질의에 대한 식별정보, 예컨대 테이블 스키마정보를 범위 질의에 포함하여 각 슬레이브 서버(200)로 전송한다.Thereafter, the master server 100 creates a thread and simultaneously transmits the range query to each slave server 200 in which the target table and a copy thereof are stored in parallel at the same time (ST600). At this time, the master server 100 transmits identification information on the original query, for example, table schema information, to each slave server 200 by including it in the range query.

그리고, 각 슬레이브 서버(200)는 범위 질의를 실행하여 해당 테이블에 저장된 결과를 획득하고, 획득된 결과를 마스터 서버(100)로 제공한다. 이때, 각 슬레이브 서버(200)는 원본 질의 식별정보를 함께 마스터 서버(100)로 제공하고, 마스터 서버(100)는 원본 질의 식별정보를 근거로 각 슬레이브 서버(200)로부터 수신된 범위 질의에 대응되는 결과를 취합하여 질의 요청 클라이언트로 제공한다. Each slave server 200 executes a range query to obtain a result stored in a corresponding table, and provides the obtained result to the master server 100. At this time, each slave server 200 provides the original query identification information together with the master server 100, and the master server 100 corresponds to the range query received from each slave server 200 based on the original query identification information. Collect the result and provide it to the query request client.

즉, 상기 실시예에 의하면, 분산 데이터베이스에서 복제본이 존재하는 테이블에 대한 질의를 여러개의 범위 조건 질의로 변경한 후, 복제본이 존재하는 다수의 노드에서 질의를 동시 실행함으로써, 복제본이 존재하는 대용량 분산 테이블에 대해 수행되는 질의 시간을 단축시킬 수 있다. That is, according to the above embodiment, after changing the query for the table in which the replica exists in the distributed database into several range condition queries, the query is executed simultaneously on a plurality of nodes where the replica exists, so that the replica exists in large capacity. You can shorten the query time for the table.

100 : 마스터 서버,
110 : 질의 분석 모듈, 120 : 질의 최적화 모듈,
130 : 질의 분할 모듈, 140 : 질의 분배 모듈,
150 : 결과 병합 모듈,
200 : 슬레이브 서버,
300 : 원본 질의,
310 ~ 330 : 범위 질의. 100: master server,
110: query analysis module, 120: query optimization module,
130: query segmentation module 140: query distribution module
150: result merge module,
200: slave server,
300: original query,
310 to 330: range query.

Claims

Queries for data in which there is a replica in a distributed database that includes a master server that distributes queries to query requests from clients, and a number of slave servers where data tables are stored and perform queries and return the results. In the parallelization method,
If the master server parses the query requested from the client and the phrase "PARTITION BY" is included, the target table is considered to be a distributed table, and if the distributed table on which the replica exists contains columns that can be scoped, A first step of determining the query as a split target query;
In the second step, the master server determines whether the number of search target records for the split target query exceeds the preset reference record number, and the reference record number is set differently according to the range condition column name of the split target query based on the query execution time.
A third step of identifying the slave server where the source and the copy exist if the number of records to be searched in the second step on the master server exceeds the preset reference record number,
On the master server, you split the record area based on the number of slave servers on which the source and copy exist and the number of records to be searched to generate a range query of different record areas to run on each slave server on which the source and copy exist. A fourth step of setting a smaller record range of a range query to be provided to the slave server based on the current load of the slave server in which the copy is stored;
And a fifth step of simultaneously transmitting a range query from each master server to each slave server and collecting and merging corresponding range query execution result data from each slave server to generate a result for the requested query from the client. Query parallelism method for data in which there is a replica in a distributed database.

delete

The method of claim 1,
In the first step, the master server determines that the partition target query is a partition target query when the data type is a number (INT) or a column including a date (DATE) in the target table. Query parallelism method for.

delete

The method of claim 1,
In the fourth step, the master server converts the sharding condition column into a range condition column in the where condition of the query and sets a range corresponding to the divided record area as the range condition column value, thereby different range conditions for each slave server. A method of parallelizing a query for data in which there is a replica in a distributed database, characterized by generating a range query with column values.