WO2019189962A1

WO2019189962A1 - Query parallelizing method for data having copy existing in distribution database

Info

Publication number: WO2019189962A1
Application number: PCT/KR2018/003696
Authority: WO
Inventors: 최재용; 정태균; 백성인; 한혁; 진성일
Original assignee: 주식회사 리얼타임테크
Priority date: 2018-03-27
Filing date: 2018-03-29
Publication date: 2019-10-03
Also published as: KR102049420B1; KR20190113055A

Abstract

The present invention relates to a technology of changing a query about data, of which a copy exists in a distribution database, into a plurality of range condition queries corresponding to the number of nodes storing the copy and simultaneously performing the range condition queries in a plurality of nodes. Thus, the technique enables reduction of a query performance time regarding high-capacity distribution data, of which a copy exists.

Description

Query Parallelism Method for Replicated Data in Distributed Database

The present invention changes a query for data in which a replica exists in a distributed database into a plurality of range condition queries corresponding to the number of nodes in which a replica is stored, and simultaneously executes a range condition query in multiple nodes, thereby allowing a large amount of replicas to exist. The present invention relates to a technique for reducing query execution time for distributed data.

With the development of wired and wireless communication technologies and the development of computer-related technologies, researches have been conducted on technologies for effectively managing data.

With the advent of user-generated data such as UCC and user-centric applications, the amount of data to be managed at once is also increasing rapidly.

In addition, with the increase in the capacity of multimedia data and the development of computer processing speed, the size of individual data is also very large.

Therefore, there is an urgent need for a management technique for large data, which is rapidly increasing in both size and quantity.

A distributed database system exists as a system for managing such a large amount of data.

In general, a distributed database system includes a master server 10 and a plurality of slave servers 20, as shown in FIG.

The master server 10 manages the slave servers 20 and manages the position of the slave server 20 to which data belongs. The slave server 20 is a server that manages the partition to which the actual data belongs, and the data is arranged and managed sequentially based on the key.

In general, a distributed database creates and manages a plurality of replicas distributed to each server for each file to improve data stability and performance. At this time, the replica may not be created according to the characteristics of the file.

In other words, database replication is one of distributed database technologies that copies an object stored in one database to another physically separate database so that it can be used in two or more database servers.

This replication technology can improve performance by distributing access to applications that use the same object across multiple database servers, or by allowing the replicated database server to be used for other purposes to meet different operational requirements.

In addition, when a failure occurs in the operating database server, it can be quickly replaced by a replica database server.

However, in the distributed database system, a query is requested from a client and a result is obtained by executing a corresponding query in a specific slave server in which original data or replica data is generally stored.

Accordingly, when the number of query search target records is greater than or equal to a certain number, since the corresponding query is obtained from one slave server and the result is obtained, the time required for obtaining the result increases, which in turn degrades the overall performance of the system. Cause.

In other words, the data is actually stored in multiple slave servers (nodes) for high availability in a distributed database, but the utilization of replicas is low unless a failure occurs.

Accordingly, the present invention was created in view of the above circumstances, and by changing a query into a plurality of range condition queries for a table in which a replica exists in a distributed database, and simultaneously executing the query through a plurality of nodes, the replica exists. Its technical purpose is to provide a query parallelization method for data in which there is a replica in a distributed database that can reduce the query execution time for large distributed tables.

According to an aspect of the present invention for achieving the above object, including a master server for distributing a query for a query request from the client, and a plurality of slave servers are stored in the data table to perform the query and return the results A method for parallelizing queries for data in which a replica in a distributed database is configured, wherein when the target table is a distributed table and the target table includes columns that can be scoped by analyzing a query requested from a client at a master server, A first step of judging the query as a split target query; a second step of judging whether the number of search target records for the split target query in the master server exceeds a preset reference record number; and a second step in the second step in the master server Exceeded preset number of reference records In the third step of checking the slave server where the source and the copy exist, the record area is partitioned based on the number of records to be searched on the master server and the number of slave servers where the copy exists. And a fourth step of simultaneously transmitting range queries from the master server to each slave server in parallel, and collecting and merging corresponding range query execution result data from each slave server, thereby generating a result for the requested query from the client. There is provided a query parallelization method for data in which a replica exists in a distributed database, comprising five steps.

In addition, in the first step, the master server determines that the query is a split target query when the master server is not a unique scan that searches a single record through parsing the query. A query parallelization method is provided.

Further, in the first step, the master server determines that the partition target query is a partition target query when the data type is a number (INT) or a column including a date (DATE) in the target table. A query parallelization method for data is provided.

In the second step, the master server provides a query parallelization method for data in which there is a copy in a distributed database, wherein the reference record number is set differently according to a query condition based on query execution time.

In addition, in the fourth step, the master server sets the record range of the range query to be provided to the slave server based on the current load of the slave server in which the copy is stored. A query parallelization method for is provided.

In addition, in the fourth step, the master server converts the sharding condition column into a range condition column in the where condition of the query, and sets a range corresponding to the divided record area as the range condition column value, thereby differenting each slave server. A query parallelization method is provided for data in which a replica exists in a distributed database, which generates a range query having a range condition column value.

According to the present invention, by replicating a query for a table in which a replica exists in a distributed database and executing a query for a different query range at the same time in a plurality of nodes where a target table exists, the replica is executed. You can shorten the query execution time for existing large distributed tables.

1 is a conceptual diagram illustrating a general distributed database configuration.

Fig. 2 is a diagram for explaining the configuration of a distributed database having a query parallelizing function for data in which a replica exists to which the present invention is applied.

3 is a view for explaining a query parallelizing method for data in which a replica exists in a distributed database according to a first embodiment of the present invention.

4 is a diagram illustrating a process of converting an original query into a plurality of range queries in FIG.

Hereinafter, with reference to the accompanying drawings will be described in detail the present invention. It should be noted that the same elements in the figures are denoted by the same reference signs wherever possible. On the other hand, the terms or words used in the present specification and claims are not to be construed as limiting the ordinary or dictionary meanings, the inventors should use the concept of the term in order to explain the invention in the best way. It should be interpreted as meanings and concepts corresponding to the technical idea of the present invention based on the principle that it can be properly defined. Therefore, the embodiments described in the present specification and the configuration shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all of the technical ideas of the present invention, and various alternatives may be substituted at the time of the present application. It should be understood that there may be equivalents and variations.

FIG. 2 is a diagram for explaining the configuration of a distributed database having a query parallelization function for data in which a copy of the present invention is applied.

As shown in FIG. 2, a distributed database having a query parallelization function for a table having a replica to which the present invention is applied distributes a query to a query request from a client, and the result provided from the slave server 200 is provided. It is configured to include a master server 100 for merging and providing to the client, and a plurality of slave servers 200 to store the data and to perform a query and return the result.

Here, the slave server 200 includes a database server for storing the original data and a database server for storing the replica. In addition, the original data may be stored in the master database 100 and the slave server 200 may be configured as a replica server that stores a replica.

The master server 100 includes a query analysis module 110, a query optimization module 120, a query partitioning module 130, a query distribution module 140, and a result merging module 150.

The query analysis module 110 parses the query requested from the client to analyze the command. For example, the query analysis module 110 analyzes command types such as search (SELECT syntax), storage (INSERT syntax), join (JOIN syntax), and the like.

The query optimization module 120 optimizes the client's query and analyzes whether the query is a split target query. The query optimization module 120 determines whether the target table for obtaining a result of the original query is a distributed table in which a replication table exists through syntax analysis. In this case, when the "PARTITION BY" syntax exists in the query, it is determined as a distribution table, and a query including the distribution table is determined as a partition target query.

Here, all data is stored in a database in the form of a table, and a table is a basic structure for storing data in a database, and one table is composed of one or more records.

In addition, the query optimization module 120 checks whether the query includes a range partitionable condition through parsing the query only when the target table is a distributed table in which a copy exists, and determines that the query is a partitionable condition query. In this case, the query is finally determined to be a split target query. In this case, the partitionable condition of the query may include a type (INT) or a date (DATE).

The query splitting module 130 generates a plurality of range queries to be sent to the slave server 200 in which the copy exists for the split target query. The query splitting module 130 converts the split target query into a range query corresponding to the number of slave servers 300, that is, the number of nodes, in which a copy of the target table is stored. Multiple range queries are set differently for column ranges in the where condition, and this column is set to a field corresponding to a partitionable condition.

The query distribution module 140 creates a thread and simultaneously transmits a range query to each slave server 200.

The result merging module 150 receives the results of the range query from each slave server 200, collects them, and provides them to the query requesting client.

3 is a diagram illustrating a query parallelizing method for data in which a replica exists in a distributed database according to an embodiment of the present invention.

When a query request is received from the client to the master server 100, the master server 100 parses and parses the original query requested from the client to determine whether the query is a split target query (ST100).

That is, the master server 100 determines whether the target table of the original query is a distributed table in which a copy exists in the slave server 200. If the original query includes the phrase "PARTITION BY", it is determined as a distribution table.

When the target table of the original query is a distribution table in step ST100, the master server 100 determines whether the target table, more specifically, a condition for analyzing a condition is included and includes a column for specifying a range (ST200). In this case, the master server 100 may determine whether the query is a range designation by parsing the query and checking whether the query is a unique scan that searches a single record. In addition, when the master server 100 satisfies a preset range expression column condition, for example, when the data type is a number (INT) or a date (DATE), the master server 100 may determine the range designation query.

That is, the master server 100 determines that the original query is a split target query when the original query exists in the distribution table and includes a column that can specify a range.

When the master server 100 determines that the original query is a split target query through the steps ST100 and ST200, the master server 100 determines whether the total number of records affected by the query, that is, the number of search target records exceeds a predetermined record reference value. (ST300). The record reference value is used to determine whether to divide the range query. If the total number of records affected by the query, that is, the number of records to be searched, is less than the preset reference value, the original query is not divided. In this case, the record reference value may be set differently according to the query condition in consideration of the query execution time according to the condition. For example, the record reference value may be set smaller when the range condition is a date than when the range condition is an ID.

Subsequently, when the number of records to be searched exceeds the record reference value in step ST300, the master server 100 checks the slave server 200 in which the original table and the copy table are stored (ST400).

The master server 100 generates a plurality of range queries having different condition ranges so as to correspond to the number of slave servers 200 that can execute the query based on the state of the slave server 200 (ST500). In this case, the master server 100 may set the slave server 200 whose current load is less than or equal to a predetermined level among the slave server 200 in which the copy is stored as the query executable slave server 200.

In addition, the master server 100 generates a range query to execute a query for different search target records by dividing the number of search target records corresponding to the query condition by the number of slave servers 200 that can execute the query. Here, the master server 100 may divide the query range differently based on the current load amount of each slave server 200 without equally dividing the query range.

4 is a diagram illustrating an example of dividing an original query into a plurality of range queries. In FIG. 4, (A) illustrates a table schema, and (B) illustrates that the original query 300 for (A) is divided into a plurality of range queries 310 to 330.

In FIG. 4, original data in which data of the sharding key is stored in “LA” is stored in the first slave server 1 (Node 1), and replica data is stored in the second slave server Node 2 and the third slave server Node 3. Is stored, the currently stored id value is stored from "0 to 30000000", and the number of records to be searched is equal to the id value.

In this case, since the target table is stored in three slave servers Node1 to Node3 in total, the original query 300 may be generated as three first to third range queries 310 to 330.

The loc column condition, which is a sharding condition in the Where condition of the original query, is converted to id, which is a range condition column, and the id column range is set based on the number of records to be searched and the number of slave servers. In FIG. 4 (B), since there are three total slave servers, the query is divided into three range queries. Since the total number of records to be searched is 30 million, an id range is set to search for records of 10 million different areas for each node. You created a range query.

In this case, the master server 100 may set different id ranges set in the slave servers Nod1 to Node3 in consideration of the state of the slave servers (Node1 to Node3) in which the copy is stored, for example, a load level or a failure. have. For example, 15 million record areas are set for the first slave server Node1 having the minimum load, and 10 million record areas are designated for the second slave server Node2 with medium load, and the load is relatively small. Many third slave servers Node3 may have 5 million record areas.

Thereafter, the master server 100 creates a thread and simultaneously transmits the range query to each slave server 200 in which the target table and a copy thereof are stored in parallel at the same time (ST600). At this time, the master server 100 transmits identification information on the original query, for example, table schema information, to each slave server 200 by including it in the range query.

Each slave server 200 executes a range query to obtain a result stored in a corresponding table, and provides the obtained result to the master server 100. At this time, each slave server 200 provides the original query identification information together with the master server 100, and the master server 100 corresponds to the range query received from each slave server 200 based on the original query identification information. Collect the result and provide it to the query request client.

That is, according to the above embodiment, after changing the query for the table in which the replica exists in the distributed database into several range condition queries, the query is executed simultaneously on a plurality of nodes where the replica exists, so that the replica exists in large capacity. You can shorten the query time for the table.

Claims

Queries for data in which there is a replica in a distributed database that includes a master server that distributes queries to query requests from clients, and a number of slave servers where data tables are stored and perform queries and return the results. In the parallelization method,

A first step in which the master server analyzes a query requested from a client and determines that the query is a split target query when the target table is a distributed table and the target table includes a column that can specify a range;

A second step of determining whether the number of search target records for the split target query on the master server exceeds a preset reference record number,

A third step of checking the slave server where the source and the copy exist if the number of records to be searched in the second step on the master server exceeds the preset reference record number,

A fourth step of generating a range query for each slave server by dividing the record area based on the number of records to be searched and the number of slave servers in which a copy exists in the master server;

And a fifth step of simultaneously transmitting a range query from each master server to each slave server and collecting and merging corresponding range query execution result data from each slave server to generate a result for the requested query from the client. Query parallelism method for data in which there is a replica in a distributed database.
The method of claim 1,

In the first step, the master server determines that the query is a partition target query if it is not a unique scan that searches for a single record through parsing the query. Parallelism method.
The method according to claim 1 or 2,

In the first step, the master server determines that the partition target query is a partition target query when the data type is a number (INT) or a column including a date (DATE) in the target table. Query Parallelism Method for.
The method of claim 1,

In the second step, the master server is a query parallelization method for data in which there is a copy in the distributed database, characterized in that the reference number of records is set differently according to the query condition based on the query execution time.
The method of claim 1,

In the fourth step, the master server sets the record range of the range query to be provided to the slave server based on the current load of the slave server in which the copy is stored. Query parallelism method.
The method of claim 1,

In the fourth step, the master server converts the sharding condition column into a range condition column in the where condition of the query and sets a range corresponding to the divided record area as the range condition column value, thereby different range conditions for each slave server. A method of parallelizing a query for data in which there is a replica in a distributed database, characterized by generating a range query with column values.