CN110909077A

CN110909077A - Distributed storage method

Info

Publication number: CN110909077A
Application number: CN201911071974.8A
Authority: CN
Inventors: 康俊忠; 蒲思羽
Original assignee: Sichuan Zhongxun Yike Technology Co Ltd
Current assignee: Sichuan Zhongxun Yike Technology Co Ltd
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-03-24

Abstract

The invention discloses a distributed storage method, which is used for realizing storage and query of big data in a cloud storage system, wherein the cloud storage system comprises a main node, distributed computing nodes and data nodes, a data management engine is operated on the main node, user query is received, the query is compiled, converted and optimized, a query execution plan is generated and the query is executed, and metadata management and node monitoring are simultaneously carried out; running a server process on the distributed computing nodes and executing a distributed computing task; deploying a work process of distributed computation and a single-node database in a data node, and storing a data table in the database of the data node; the sub-queries converted by the user query are executed in a database or in a distributed computing framework. A hybrid data repository architecture that combines a database and a distributed computing framework is presented.

Description

Distributed storage method

Technical Field

The invention relates to the technical field of data processing, in particular to a distributed storage method.

Background

With the rapid development of applications such as mobile internet, internet of things and the like, the global data volume has increased explosively. The rapid increase in data volume is predictive of the large data age that has now been entered. The network operator has huge users and has the capability of controlling the terminal and the user internet access channel, so that the network operator has a good data base in the aspect of user behavior analysis, deeply analyzes the flow behavior characteristics and rules of the users, finds the potential consumption requirements of the users, and is an effective means for improving the value and the operation level. However, not only is the data size larger and larger, but the large data types and processing real-time requirements greatly increase the complexity of large data processing. Big data presents technical challenges for traditional data analytics processing technologies (e.g., parallel databases, data warehouses). The traditional data analysis processing technology cannot process the high expansibility and mass requirements of big data; the traditional data analysis and processing method is only used for one type of data and is single, and big data has the characteristics of huge data quantity, complex structure, numerous types and the like, so that a new challenge is provided for the storage, processing and analysis of the big data. The efficiency and stability of parallel databases become the first choice in performance of data analysis. However, based on the consideration of cost, with the popularization of a cloud computing service platform, a large-scale data analysis task is transferred from a high-end server deployed in a parallel database to a lower-end server cluster of a cheaper shared-nothing architecture, which becomes a cost bottleneck problem that mass data analysis really needs to be solved at present. Therefore, no effective solution has been proposed to solve the above problems in the related art.

Disclosure of Invention

The invention aims to provide a distributed storage method for realizing storage and query of big data in a cloud storage system, wherein the cloud storage system comprises a main node, distributed computing nodes and data nodes,

the method comprises the steps that a data management engine runs on a main node, receives user query, compiles, converts and optimizes the query, generates a query execution plan, executes the query, and simultaneously performs metadata management and node monitoring; running a server process on the distributed computing nodes and executing a distributed computing task;

deploying a work process of distributed computation and a single-node database in a data node, and storing a data table in the database of the data node;

executing the sub-queries converted from the user query in a database or in a distributed computing framework;

the data table adopts a two-dimensional relation table structure, the data table is stored by adopting independent partition storage and combined partition storage, and when the table is independently partitioned, the number n of partitions, a partition key attribute column AP according to the partitions and a redundancy coefficient k are specified; for each tuple of a table needing to be divided, calculating a division ID (identity) to which the tuple belongs according to the value of a division key AP (access point), and then storing the tuple into a database of one or more nodes corresponding to the division; if the external code of the table A is on the partition key AP of the fact table A, the main code BP pointing to the dimension table B, namely the partition key AP of the table A is also a connection key used when the table A is connected with the table B, the connection operation of the cross node is converted into the local connection operation and pushed down to the database to be executed, and at the moment, the data of the two tables are combined and divided;

when performing combined partitioning on a table, using hash-based partitioning or range-based partitioning to divide data into p independent partitions, each partitioned data being stored on k different nodes;

if the table B is subjected to combined division depending on the table A, the division number of the table B is equal to that of the table A, the redundancy coefficient kB of the table B is equal to that of the table A, and the storage node of each division of the table B is the storage node of the corresponding division of the table A;

if the redundancy coefficient kB of the table B is smaller than the redundancy coefficient kA of the table A, each divided storage node of the table B is the previous kB nodes in the storage nodes correspondingly divided by the table B;

the redundancy coefficient kB of table B is greater than the redundancy coefficient kA of table a, and then each partitioned storage node of table B is expanded in addition to the storage node containing the corresponding partition of table a, and the expanded (kB-kA) nodes are the nodes immediately following the original node chain.

Preferably, when the tuple of the table is independently divided, the proper hash function is applied to the tuple dividing key AP based on the division of the hash or the division based on the range, and the obtained hash value is subjected to modulo operation on the division number n to obtain the division ID of the tuple; applying different hash functions for different data types; the range-based division divides the candidate value interval of the attribute column AP into a plurality of continuous ranges in advance, each range corresponds to one division, and the range where the value of the tuple attribute column AP is located is used as the division of the tuple.

Preferably, the query execution further comprises: 1) a user submits a query through a client, and a data management engine receives the user query; 2) performing lexical and syntactic analysis on the query statement to generate a syntax tree, then converting the syntax tree into a standard relational algebra tree, and performing semantic inspection; converting the relational algebra tree into a logic query plan, and applying heuristic rules to preliminarily optimize the logic query plan; selecting an optimal query path according to the cost model to generate an actual query plan; converting an actual query plan into a task scheduling graph, wherein each task in the task scheduling graph is a sub-query and corresponds to a distributed computing task, and each task can be executed only after the task on which the task depends is executed; 3) scheduling and monitoring the execution of tasks, sequentially submitting the tasks to a distributed computing server according to execution dependency relations among the tasks, reporting the execution state of each task, storing an intermediate result or a final result generated after the execution of a single task into a table of a database or writing the intermediate result or the final result into a distributed file system, and realizing the transmission of input and output data among different tasks in a data materialization mode; 4) and returning the finally generated result to the user.

Preferably, the data management engine further comprises: the metadata management module is used for storing metadata information of a database, wherein the metadata comprises a mode of a data table, a dividing and storing method of table data and data node information;

the query compiling module is used for compiling the query submitted by the user to generate a logic query plan; the query optimization module is used for optimizing the logic query plan by using a rule-based and cost-based method to obtain an actual query plan, then converting the actual query plan into a task scheduling graph consisting of distributed computing tasks, and submitting the task scheduling graph to the query execution module for execution;

the query execution module is used for scheduling the distributed computing tasks, synchronously executing the scheduling according to the dependency relationship of each task and monitoring the execution state of the tasks, wherein each task can be started only after all the tasks which depend on the task are successfully executed;

and the node monitoring and load balancing module is used for polling the state of each data node at regular time, updating corresponding metadata after the node is found to be invalid, adding new redundancy to the data when the redundancy value is lower than a predefined threshold value, periodically checking the data distribution state, and redistributing the data when the load of the node is found to be unbalanced.

Compared with the prior art, the invention has the beneficial effects that: a hybrid data repository architecture that combines a database and a distributed computing framework is presented. The distributed storage method of the big data is improved, the opportunity of pushing down the query to the database to be executed is increased, and the data transmission cost caused by cross-node connection is avoided. The task scheduling algorithm based on the queue improves the query parallelism; meanwhile, a lightweight response mode of the brief query is supported; the method has good loading performance, query performance and fault-tolerant capability.

Drawings

FIG. 1 is a flow chart of a distributed storage method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

One aspect of the invention provides a distributed storage method and system. Fig. 1 is a flowchart of a distributed storage method according to an embodiment of the present invention. As shown in FIG. 1, the embodiment for carrying out the present invention is as follows:

the cloud storage system is deployed on a shared-nothing cluster, Hadoop is used as a computing layer, a single-node database is used as a storage layer, and a middleware technology is realized. The cloud storage system is mainly divided into three parts, namely a main node, a distributed computing node (Hadoop node) and a data node. The main node runs the engine of the invention and is responsible for receiving user query, compiling, converting and optimizing the query, generating a query execution plan and executing the query, and is also responsible for metadata management and node monitoring; the Hadoop server process runs on the Hadoop node and is responsible for executing Hadoop tasks; and deploying a Hadoop working process on the data nodes, and deploying a single-node database at the same time. The data tables are stored in a database of data nodes, and the user query is converted into a plurality of sub-queries, or executed in the database, or executed by using a Hadoop distributed computing framework.

According to a further aspect of the present invention, a data management engine for managing big data storage and queries provides the following:

and a metadata management module. The data management system is used for storing metadata information of the database, including a mode of a data table, a dividing and storing method of table data, data node information and the like, and the metadata is stored in a special database.

And querying a compiling and optimizing module. The query compiling module compiles the query submitted by the user to generate a logic query plan; the query optimization module optimizes the logic query plan by using a rule-based and cost-based method to obtain an actual query plan, and then converts the actual query plan into a task scheduling graph consisting of Hadoop tasks and submits the task scheduling graph to the query execution module for execution.

And a query execution module. The main task is to schedule the Hadoop tasks in order. According to the dependency relationship of each task, the synchronous execution scheduling means that each task can be started only after all tasks depended on by the task are successfully executed. The query execution module is responsible for scheduling the tasks and monitoring the execution state of the tasks.

And the node monitoring and load balancing module. And polling the state of each data node at regular time, and updating corresponding metadata in time after the node is found to be invalid. When the redundancy value is lower than a predefined threshold value, the load balancing module needs to add new redundancy to the data. The load balancing module also needs to check the data distribution state periodically and redistribute the data when the node load is found to be unbalanced.

The data table of the present invention uses a two-dimensional relationship table structure to represent entities and connections between entities. Each row of the relational table represents a tuple and each column is called an attribute. In the relational model, both entities and relationships between entities are represented using relational tables. The system has a fact table and a plurality of dimension tables, all of which are directly connected to the fact table. For a table storing large data, a single node cannot store all data, and thus the data must be divided and then distributed to be stored in a plurality of nodes. Because of the many join operations involving fact tables and dimension tables in the database, these join operations entail a large amount of network traffic. To improve the efficiency of the query, consideration must be given to how to minimize the network transmission of data, such as making the connection operation as local as possible without having to be performed across nodes.

The invention provides two data table storage methods, which comprise the following steps:

1. independent partitioned storage of tables

"independent" means that the large data distribution strategy of the table is not affected by other tables, and the method is more suitable for the fact table with large data volume. When the table is divided independently, the number n of divisions, the division key attribute row AP according to the division, and the redundancy coefficient k need to be specified. For each tuple of the table needing to be divided, calculating a division ID to which the tuple belongs according to the value of a division key AP, and then storing the tuple in a database of one or more nodes corresponding to the division.

The present invention supports two modes of partitioning tuples of a table, hash-based partitioning and range-based partitioning. Based on the division of the hash, applying a proper hash function on the tuple division key AP, and performing modulo operation on the division number n by the obtained hash value to obtain the division ID of the tuple;

hash functions need to be appointed based on hash division, and improper hash functions are easy to cause uneven distribution to cause data inclination, so that the system of the invention applies different hash functions aiming at different data types to avoid data inclination as much as possible;

the range-based division divides the candidate value interval of the attribute column AP into a plurality of continuous ranges in advance, each range corresponds to one division, and the range where the value of the tuple attribute column AP is located is used as the division of the tuple. The range-based division is generally suitable for date-type big data, and the query efficiency can be effectively improved by storing the big data in different time ranges in a distributed manner.

Further, the present invention uses an improved chain distribution rule for distributed redundant storage of large data. Specifically, in a cluster with n nodes, where table a uses a partitioning method to partition large data into p partitions, p nodes need to be selected as storage nodes, the data of partition i is stored on node i, and k backup data thereof is stored on nodes i +1, i +2, …, i + k (modulo p). And the data of partition i is lost only if nodes i +1, i +2, …, i + k all fail at the same time.

2. Combined partitioned storage of tables

The independent division has the characteristic that the division number and the node distribution of the data table are independent of other tables and are completely independent. The cost of the connection operation comes in large part from network transmission. If the partition key AP of the fact table A is just the outer code of the fact table A and points to the main code BP of the dimension table B, namely the partition key AP of the table A is also the connection key used when the table A is connected with the table B, and the connection condition for performing the connection operation by the table A and the table B is that AP is BP, the connection operation of the cross-node can be converted into the local connection operation and pushed down to the database for execution, and only the data combination of the two tables needs to be placed.

To this end, the present invention designs a combined partitioning, using either a hash-based partitioning or a range-based partitioning method to divide the big data into p independent partitions, each partitioned data being stored on k different nodes. However, "combining", that is, its big data distribution policy depends on other tables, and thus the number of divisions of a table combining the divisions and the big data distribution are limited. If table B is divided in combination depending on table a, the number of divisions of table B is equal to that of table a, and the storage nodes of each division of table B are identical to table a. The following 3 cases are distinguished:

1) if the redundancy factor kB of Table B is equal to the redundancy factor kA of Table A, then each partitioned storage node of Table B is exactly the storage node of the corresponding partition of Table A.

2) If the redundancy coefficient kB of the table B is smaller than the redundancy coefficient kA of the table A, each divided storage node of the table B is the first kB nodes in the storage nodes of the corresponding division of the table B.

3) If the redundancy coefficient kB of table B is greater than the redundancy coefficient kA of table a, then each partitioned storage node of table B will have to be expanded in addition to containing the corresponding partitioned storage node of table a, but the expanded (kB-kA) nodes are the nodes immediately following the original chain of nodes, forming an expanded chain.

It can be demonstrated that the data distribution obtained by the combinatorial partitioning method still satisfies the chain distribution rule no matter what the redundancy coefficient of table B belongs to. The combined division and storage increases the opportunity of local connection operation, and avoids data transmission cost brought by cross-node connection as much as possible. The connection can be conveniently pushed to a database to be executed, and higher query efficiency is obtained by utilizing a database query optimization technology.

According to another aspect of the invention, a query method based on the above architecture and storage method is provided.

The present invention supports a subset of the standard SQL language, supports join operations for multiple tables and common aggregation functions such as SUM, COUNT, AVG, etc.

The invention supports simple distributed computing extension, a user can define own Map and Reduce functions, input data of Mapper is provided by the bottom framework of the invention, and the user can specify which table the input of Mapper comes from and specify SQL sentences how to acquire data from the table.

According to the preferred embodiment of the present invention, the query execution process mainly comprises the following steps, which are divided into query submission, compilation and optimization, execution and result return:

1) the user submits the query through the client, and the data management engine hands the query to the query compiling and optimizing module.

2) The query compiling and optimizing module respectively compiles and optimizes the query. The query compiling module firstly carries out lexical and syntactic analysis on the query statement to generate a syntax tree, then converts the syntax tree into a standard relational algebra tree, and also relates to semantic check in the process, including checking whether a table exists, whether data types are matched and the like. The query optimization module firstly converts the relational algebra tree into a logic query plan, and performs preliminary optimization on the logic query plan by applying heuristic rules, such as projection and predicate selection push-down, and then selects an optimal query path according to the cost model to generate an actual query plan. And finally converting the actual query plan into a task scheduling graph, and submitting the task scheduling graph to a query execution module for execution. Each task in the task scheduling graph is a sub-query and corresponds to a Hadoop task. The execution dependency relationship exists between the tasks, namely each task must be executed after the execution of the task which depends on the task is completed, and the execution can not exist in a circular dependency, so that the task scheduling graph is also a directed acyclic graph. The metadata base needs to be accessed in the whole process to obtain various metadata information.

3) The query execution module is responsible for scheduling and monitoring the execution of tasks, orderly submits the tasks to the Hadoop server according to the execution dependency relationship among the tasks, and reports the execution state of each task. Multiple tasks may be performed concurrently. Intermediate results or final results generated after the execution of a single task are stored in a table of a database or written into a Hadoop distributed file system. The transmission of input and output data is realized among different tasks in a data materialization mode.

4) The final generated result is returned to the user, and the user can select the terminal to output and store in the database.

It can be seen that the present invention seamlessly combines the underlying database storage with the upper distributed computing framework and flexibly utilizes and combines various execution paths to obtain an optimal query execution scheme.

In the aspect of the task scheduling graph, the task scheduling graph obtained by the query execution module is a directed acyclic graph, nodes of the graph are single Hadoop tasks, and directed edges between the nodes represent dependency relationships between the tasks. When the tasks are scheduled, the dependency sequence among the tasks is met, and the execution of the tasks is parallelized as much as possible. Because there may be multiple executable tasks simultaneously at the same time, the tasks are independent of each other, and if executed serially, the resources are not fully utilized. The invention preferably uses a queue-based task scheduling algorithm, and uses 5 queues, which respectively correspond to different states of the task. Initially, all tasks are in the wait queue; the execution module traverses all tasks in the waiting queue, and if all dependent tasks of the tasks are successfully executed, the tasks are transferred into a ready queue; tasks in the ready queue are submitted to the Hadoop server and transferred to the running queue, and the submission process is asynchronous, namely the query execution module cannot block the completion of the tasks; the execution module periodically checks the state of each task in the running queue, if the task is successful, the execution module moves into a success queue, and if the task is failed, the execution module moves into a failure queue; the execution module iterates the processes until all tasks are executed successfully or any task is executed unsuccessfully; only if all tasks are executed successfully, does the entire query execute succeed.

The parallel body of query execution is that tasks are asynchronously submitted, and an execution module cannot be blocked by waiting for the completion of task execution, so that when a plurality of tasks become executable states at the same time, the execution module almost submits the tasks to the Hadoop server at the same time, and the execution processes of the tasks are overlapped and share resources.

The starting cost of the Hadoop task. For simple query, if the query is converted into Hadoop task execution according to a normal query execution process, it is likely that the starting time of the Hadoop task occupies most of the overall response time of the query. The invention provides another query execution mode, namely a lightweight response mode, which is used for carrying out lightweight response processing by applying a query interpreter, a query optimization module and a query execution module in the face of SQL request service. When a query is simple and can be executed without using a Hadoop task, the query execution module can be directly connected with the databases of all nodes to execute the query, then the results of all nodes are locally combined, necessary aggregation operation is carried out, and finally the final result is returned. The method avoids the starting cost of the Hadoop task, and the response time of the whole query is greatly shortened.

According to another aspect of the present invention, there is provided a distributed storage system including a master node, distributed computing nodes, and data nodes, wherein:

the main node is used for operating a data management engine which is configured to receive user query, compile, convert and optimize the query, generate a query execution plan, execute the query, and simultaneously perform metadata management and node monitoring;

the distributed computing nodes are used for running server processes and executing distributed computing tasks;

the data node is used for deploying work processes of distributed computing and a single-node database, data tables are stored in the database,

wherein the sub-queries converted from the user query are executed in a database or in a distributed computing framework.

The data management engine further comprises:

the metadata management module is used for storing metadata information of a database, wherein the metadata comprises a mode of a data table, a dividing and storing method of table data and data node information;

the query compiling module is used for compiling the query submitted by the user to generate a logic query plan;

the query optimization module is used for optimizing the logic query plan by using a rule-based and cost-based method to obtain an actual query plan, then converting the actual query plan into a task scheduling graph consisting of distributed computing tasks, and submitting the task scheduling graph to the query execution module for execution;

In summary, the present invention proposes a hybrid data repository architecture that combines a database and a distributed computing framework. The distributed storage method is improved, the opportunity of pushing down the query to the database for execution is increased, and the data transmission cost caused by cross-node connection is avoided. The task scheduling algorithm based on the queue improves the query parallelism; meanwhile, a lightweight response mode of the brief query is supported; the method has good loading performance, query performance and fault-tolerant capability.

The method adopts different treatment modes for the soil with different depths, treats the soil with relatively low deep salinization by adopting a small amount of improved materials, treats the soil with relatively high surface salinization by adopting more mixed improved materials, realizes the treatment of the separation of different soils, and improves the treatment effect of the saline-alkali soil.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A distributed storage method is used for realizing storage and query of big data in a cloud storage system, the cloud storage system comprises a main node, distributed computing nodes and data nodes,

2. A distributed storage method according to claim 1, wherein: when the tuple of the table is divided independently, a proper hash function is applied to the tuple dividing key AP based on the division of the hash or the division based on the range, and the obtained hash value is subjected to modulus operation on the division number n to obtain the division ID of the tuple; applying different hash functions for different data types; the range-based division divides the candidate value interval of the attribute column AP into a plurality of continuous ranges in advance, each range corresponds to one division, and the range where the value of the tuple attribute column AP is located is used as the division of the tuple.

3. A distributed storage method according to claim 1, wherein: the query execution further comprises: 1) a user submits a query through a client, and a data management engine receives the user query; 2) performing lexical and syntactic analysis on the query statement to generate a syntax tree, then converting the syntax tree into a standard relational algebra tree, and performing semantic inspection; converting the relational algebra tree into a logic query plan, and applying heuristic rules to preliminarily optimize the logic query plan; selecting an optimal query path according to the cost model to generate an actual query plan; converting an actual query plan into a task scheduling graph, wherein each task in the task scheduling graph is a sub-query and corresponds to a distributed computing task, and each task can be executed only after the task on which the task depends is executed; 3) scheduling and monitoring the execution of tasks, sequentially submitting the tasks to a distributed computing server according to execution dependency relations among the tasks, reporting the execution state of each task, storing an intermediate result or a final result generated after the execution of a single task into a table of a database or writing the intermediate result or the final result into a distributed file system, and realizing the transmission of input and output data among different tasks in a data materialization mode; 4) and returning the finally generated result to the user.

4. A distributed storage method according to claim 1, wherein: the data management engine further comprises: the metadata management module is used for storing metadata information of a database, wherein the metadata comprises a mode of a data table, a dividing and storing method of table data and data node information;