CN105824957B

CN105824957B - The query engine system and querying method of distributed memory columnar database

Info

Publication number: CN105824957B
Application number: CN201610193220.XA
Authority: CN
Inventors: 段翰聪; 王瑾; 闵革勇; 聂晓文; 郑松; 张博
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-03-30
Filing date: 2016-03-30
Publication date: 2019-09-03
Anticipated expiration: 2036-03-30
Also published as: CN105824957A

Abstract

The invention discloses a kind of query engine system of distributed memory columnar database and querying method, querying method includes: that resource management module determines the session that a main query engine is responsible between user；The sql like language that user sends is converted to inquiry plan by main query engine；Resource management module is that main query engine is distributed from query engine；Inquiry plan is divided at least two subtasks by main query engine, and is distributed for each subtask from query engine；Current subtask is executed after the completion of the forerunner subtask of current subtask all executes, current subtask is executed into the slave query engine where the intermediate data for completing to generate is transmitted to subsequent subtask, and current subtask completion status is sent to main query engine；Main query engine notifies client obtaining final result data from query engine.The query engine system and querying method of distributed memory columnar database provided by the invention, available good search efficiency.

Description

The query engine system and querying method of distributed memory columnar database

Technical field

The present invention relates to database technical fields, and in particular to a kind of query engine system of distributed memory columnar database System and querying method.

Background technique

NewSQL is to various new expansible, high-performance data library abbreviations, and this kind of database not only has NoSQL pairs The storage management ability of mass data also maintains traditional database and supports the characteristics such as ACID and SQL.In general, NewSQL Be roughly divided into three classes: new architecture takes different design methods, such as Google using completely new database platform Spanner, Clustrix, VoltDB and MemSQL；SQL query engine, the SQL storage engines of height optimization, provides The identical programming interface of MySQL, but scalability is more preferable than built-in engine InnoDB；Transparent fragment provides the middleware of fragment Layer, database are segmented in multiple node operations automatically.Over time, the NewSQL database of these three types by It gradually merges, the extensive distribution towards on-line analytical processing (OLAP, Online Analytical Processing) of being born Formula memory columnar database.

Query engine is the core of Database Systems, is responsible for the execution tune of entire Database Systems inquiry calculating task Degree.The SQL statement of one user input, will do it SQL statement morphology syntax parsing generative grammar first in Database Systems Tree, deforms syntax tree using database query optimizer, is finally converted to what database query engine can identify Inquiry plan.Inquiry plan tells how query engine executes, and how from database bottom storage engines to extract data, logarithm According to deform the result for being finally converted into user and wanting.

HIVE is a Tool for Data Warehouse based on Hadoop, and provides simple SQL query function, can be by SQL Sentence is converted into MapReduce task and is run.For SQL statement SELECT c_custkey FROM customer JOIN nation ON customer.C_NATIONKEY = nation.N_NATIONKEY JOIN lineitem ON Lineitem.L_PARTKEY=customer.C_CUSTKEY, HIVE to a SQL query plan and task execution process such as Shown in Fig. 1.What HIVE was really executed is MapReduce task, so inquiry plan can be converted into MapReduce task-set It closes, former inquiry plan is converted to two MapReduce tasks.Wherein, JOB1 is responsible for calculating Join1, that is, lineitem The Join operation of table and customer table；JOB2 is responsible for calculating Join2, that is, calculates Join1 result and nation table Join operation, final output.After JOB1 executes completion, external storage system, JOB2 can be written in intermediate result data It can just start to execute, and JOB2 can read the intermediate result that JOB1 is generated from external storage system and then carry out calculating work Make.The shortcomings that HIVE, is it is clear that its bottom uses MapReduce computation module, for every two MapReduce calculating task Between data sharing, the result of one of calculating task can only be output to external storage system (distributed file system Or local file system), the latter calculating task reads data from external storage system and is calculated, and leads to a large amount of disk I/O, so that entire query process delay is higher.

Spark-SQL is that another Tool for Data Warehouse is similar with HIVE function, but Spark-SQL bottom uses Spark computation model rather than MapReduce computation module.For SQL statement SELECT c_custkey FROM customer JOIN nation ON customer.C_NATIONKEY = nation.N_NATIONKEY JOIN Lineitem ON lineitem.L_PARTKEY=customer.C_CUSTKEY, Spark-SQL are to a SQL query plan It is as shown in Figure 2 with task execution process.Stage1 is mainly used to handle the ScanTable(lineitem in inquiry plan) and ScanTable(customer), RDD1 and RDD2 are respectively corresponded.Since RDD is distributed elastic data set, corresponding multiple physics Node, each physical node can execute corresponding task, so a RDD is by multiple Task(tasks executed parallel) it obtains, Such as RDD1 is just calculated by Task1-1, Task1-2.After having read lineitem table and customer table content, Stage2 It is mainly used to handle Join1 operation and ScanTable(nation) operation, RDD3 and RDD4 is generated respectively.Finally, Stage3 is used To complete Join2 operation.Spark-SQL is many fastly relative to HIVE on computing relay, but still remains some disadvantages.

One be Spark-SQL bottom using scala language realize, overall operation on a java virtual machine, memory Administrative mechanism depends on Java Virtual Machine.And Java Virtual Machine memory management mechanism is a kind of general memory management mechanism, In database query engine, the internal memory optimization customized is not done for database query engine, Spark-SQL is caused to calculate A large amount of memory headroom is consumed in the process.

Secondly being executed during being Spark-SQL task execution according to phase sequence, as Stage2 starts before executing The condition of mentioning is that Stage1 executes completion, and the precondition that Stage3 is executed is that Stage2 executes completion.Each Stage includes several A Task(task that can be executed parallel), the execution delay of each Stage is determined by executing time longest Task in the Stage It is fixed.Thus one is led to the problem of, executing after fast Task is completed can wait other in same Stage to be not carried out completion Task, after completing to task executions all in same Stage, the Task in next Stage can just start to execute.For example, Task1-1, Task1-2, Task2-1 and Task2-2 in Stage1, Task3-1 and Task3-2 in Stage2, Task3-2 depends on the calculated result of Task1-1, Task1-2 and Task2-1.If Task1-1, Task1-2 and Task2-1 task execution is completed and Task2-2 is not carried out completion, even when Task3-1 meets execution condition, is counted in Spark Under the constraint condition for calculating frame, Task3-1 still cannot start to execute, and need to complete just to start later until Task2-2 is executed It executes.If it is too long that Task2-2 executes the time, the computing relay of entire calculating process will affect.

Summary of the invention

To be solved by this invention is the low problem of existing database query engine computational efficiency.

The present invention is achieved through the following technical solutions:

A kind of query engine system of distributed memory columnar database, including resource management module, at least one master are looked into Ask engine and at least one from query engine；The main query engine is used to sql like language being converted to inquiry plan, will inquire Plan is divided at least two subtasks, and is responsible for the implementation procedure of monitoring and scheduling inquiry plan；It is described to be used from query engine In the subtask for executing the main query engine distribution；The resource management module is used to be responsible for the management of system resource and divides Match.

Optionally, the system resource includes CPU computing resource and memory source.

Based on the query engine system of above-mentioned distributed memory columnar database, the present invention also provides a kind of distributed memories The querying method of columnar database, comprising: resource management module determines the session that a main query engine is responsible between user； The sql like language that user sends is converted to inquiry plan by main query engine；Resource management module is that main query engine is distributed from looking into Engine is ask, and establishes the communication between query engine and main query engine；Inquiry plan is divided at least by main query engine Two subtasks, and distribute for each subtask from query engine；It is added to task queue from query engine by subtask, is working as The forerunner subtask of preceding subtask executes current subtask after the completion of all executing, and current subtask is executed in completing to generate Between data be transmitted to the slave query engine where subsequent subtask, and current subtask completion status is sent to main inquiry and is drawn It holds up；After the completion of entire inquiry plan, main query engine notifies client obtaining final result data from query engine.

Inquiry plan is divided into several subtasks for having dependence by the present invention, and by subtask distribute to accordingly from In the task queue of query engine, by successively executing the subtask in task queue from query engine, without In Spark-SQL, although some task meets executable condition in the latter half, since Spark-SQL executes the limit of frame System, and the shortcomings that executing calculating task cannot be started.Therefore, looking into using distributed memory columnar database provided by the invention Inquiry method, available good search efficiency.

Optionally, subtask uses physics operator representation, and the physics operator includes extracting column data operation, connection behaviour Work, condition filter operation, division operation, aggregate function operation, sorting operation and final result data convert is embarked on journey table At least one of operation.

Optionally, main query engine is that each subtask is distributed from query engine according to Cost Model.Using Cost Model It is the distribution of each subtask from query engine, can be the smallest from query engine for each subtask distribution Executing Cost, thus Further increase search efficiency.

Optionally, it includes: according to from looking into that main query engine, which is that each subtask distributes from query engine according to Cost Model, The metadata information for asking engine obtains the database table information stored from the IP of node where query engine and the node and column Information；Each in principle distribution inquiry plan, which is localized, according to data extracts the execution node IP of column data operation；Using greed The non-execution node for extracting column data operation of algorithm picks.

Optionally, the state of each subtask include etc. pending state, calculating state, distribution data mode, hold Row finishes state and executes status of fail.

Optionally, the original state of current subtask be etc. pending state, from query engine where current subtask After receiving the intermediate data that all forerunner subtasks, current subtask execute completion generation, the state of current subtask is changed to just In the state of calculating；After the completion of current subtask calculates, the state of current subtask is changed to distribution data mode, and produce calculating Raw intermediate data is sent to the slave query engine where subsequent subtask；If intermediate data is sent successfully, by current subtask State be changed to the state of being finished；If waiting between pending state and calculating state, calculating state and distribution number According between state or distributing data mode between the state that is finished and breaking down, the state of current subtask is changed to hold Row status of fail；When state in current subtask changes, the main query engine of asynchronous notifications.

Optionally, from the column data that the intermediate data transmitted between query engine is by compression processing.In traditional data In the enforcement engine of library, intermediate data is occurred by the form of table, and data storage is stored according to row, however in most of analytic type industry Under scene of being engaged in, user's only several attributes in one relation table of relationship can be additional in calculating process by the way of row storage It loads the unconcerned attribute data of user institute and has well solved this by the way of column storage to cause the waste of memory One problem.

Optionally, the compression processing includes position compression processing and dictionary compression processing.Using dictionary compression handle and The mode of position compression processing can further decrease memory overhead, improve the service efficiency of memory.

Compared with prior art, the present invention having the following advantages and benefits:

The query engine system and querying method of distributed memory columnar database provided by the invention, pass through asynchronous schedule The execution of each subtask improves integral operation efficiency, i.e., inquiry plan is divided into several subtasks for having dependence, and Subtask is distributed to accordingly from the task queue of query engine, by successively executing the son in task queue from query engine Task.Further, from the column data that the data transmitted between query engine are by compression processing, the side using row storage is solved Formula additionally loaded in calculating process user unconcerned attribute data and cause the waste problem of memory.

Detailed description of the invention

Attached drawing described herein is used to provide to further understand the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:

Fig. 1 is a SQL query plan and the task execution flow diagram of HIVE；

Fig. 2 is a SQL query plan and the task execution flow diagram of Spark-SQL；

Fig. 3 is the part-structure signal of the query engine system of the distributed memory columnar database of the embodiment of the present invention Figure；

Fig. 4 is a SQL query plan schematic diagram of the embodiment of the present invention；

Fig. 5 is the task execution flow diagram of the embodiment of the present invention；

Fig. 6 is the execution state transition diagram of the subtask of the embodiment of the present invention；

Fig. 7 is the schematic diagram that data are transmitted between the slave query engine of the embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this Invention is described in further detail, and exemplary embodiment of the invention and its explanation for explaining only the invention, are not made For limitation of the invention.

Embodiment

The present embodiment provides a kind of query engine system of distributed memory columnar database, the distributed memory column The query engine system of database include resource management module, at least one main query engine and at least one draw from inquiry It holds up.

Specifically, sql like language is converted to inquiry plan by parsing sql like language by the main query engine, and inquiry is counted Division be cut into behind at least two subtasks be distributed to it is described executed from query engine, and be responsible for holding for monitoring and scheduling inquiry plan Row process and fault-tolerant processing.With it is similar in the prior art, inquiry plan is indicated with tree.It is described to be used for from query engine The subtask of the main query engine distribution is executed, the resource management module is used to be responsible for the management and distribution of system resource. Further, the system resource includes CPU computing resource and memory source.Fig. 3 is the distributed memory column number of the present embodiment According to the partial structure diagram of the query engine system in library, main query engine 31 corresponding three from query engine: from query engine 32, from query engine 33 and from query engine 34.

The present embodiment also provides the distributed memory of the query engine system based on above-mentioned distributed memory columnar database The querying method of columnar database, comprising:

Step S1, resource management module determine the session that a main query engine is responsible between user.Specifically, with When there is query demand at family, resource management module is creating the responsible session between user of a main query engine in resource pool.

The sql like language that user sends is converted to inquiry plan by step S2, main query engine.Main query engine passes through morphology Parsing and syntax parsing, and rule-based query optimization, are converted into inquiry plan for sql like language.With class in the prior art Seemingly, inquiry plan is indicated with tree.

Step S3, resource management module are that main query engine is distributed from query engine, and is established and looked into from query engine and master Ask the communication between engine.After sql like language is converted into inquiry plan, main query engine is to resource management module application meter Calculate resource, resource management module distribution gives main query engine from query engine, and establish from query engine and main query engine it Between network connection.

Inquiry plan is divided at least two subtasks by step S4, main query engine, and for the distribution of each subtask from Query engine.By query engine in this present embodiment towards be distributed memory columnar database, tables of data is in distributed column By column storage in database, and each column are cut into several fragments according to value range.For this characteristic, the present embodiment is taken out As having gone out several physics operators, for indicating the specific subtask of some in inquiry plan.The physics operator includes extracting Column data operation, attended operation, condition filter operation, division operation, aggregate function operation, sorting operation and by final result Data convert is embarked on journey at least one of the operation of table.

Column data operation: i.e. GetColumn operator is extracted, is responsible for extracting the data of a certain column in column database, GetColumn operator itself can be with additional restrictions, such as Teacher.age > 1 GetColumn(Teacher.age), Indicate that the age for extracting Teacher table is arranged, and age value is greater than 1.

Attended operation: i.e. Join operator is responsible for executing Join operation, including Left Join, Right Join, Full Join etc..

Condition filter operation: i.e. Filter operator is responsible for executing condition filter operation, mainly includes the logics such as AND and OR Operation.

Division operation: i.e. GroupBy operator is responsible for executing GroupBy division operation, for meeting in SQL statement The function of GroupBy keyword.

Aggregate function operation: i.e. AGG operator, including Max(maximizing), Avg(averages) etc. databases are common grasps Make.

Sorting operation: i.e. Order operator, the column for sorting to needs are ranked up operation.

Final result data convert is embarked on journey the operation of table: i.e. BuildRow operator, for by column database final result number The row table being understood that according to user is reduced into, is presented to user for final result in the form of relation table.

For example, a specific SQL statement SELECT c_custkey FROM customer JOIN nation ON customer.C_NATIONKEY = nation.N_NATIONKEY JOIN lineitem ON lineitem. L_ PARTKEY=customer.C_CUSTKEY, the inquiry plan generated by the parsing of main query engine is as shown in figure 4, be divided into Subtask as shown in figure 5, including six from query engine: from query engine Slave-QE1, from query engine Slave-QE2, From query engine Slave-QE3, from query engine Slave-QE4, from query engine Slave-QE5 and from query engine Slave-QE6。

Assuming that there are two fragments for each column, then for having a GetColumn operator on the fragment of each column, Since the fragment of each column has codomain range, then can also generate the Join operator based on the fragment range for each fragment. With reference to Fig. 5, Join1 node indicates the equivalent attended operation of column L_PARTKEY and C_CUSTKEY, in actual subtask, Join1 is split into two specific physics operator, Join1-1 and Join1-2, is each responsible for codomain range in 1-100 and value Domain range is operated in the equivalent Join of 101-150.And so on, Join2 is also split as two specifically in inquiry plan Join operator.

Further, it is that each subtask is distributed from query engine that main query engine, which is according to Cost Model, in the present embodiment. Specifically, main query engine is according to the IP and the section obtained from the metadata information of query engine from node where query engine The database table information and column information of point storage.Each in principle distribution inquiry plan, which is localized, according to data extracts column data behaviour The execution node IP of work.Such as in Fig. 5, divide from what physical node storage L_PARTKEY where query engine Slave-QE1 was arranged Sheet data, then the GetColumn operator for the fragment data is just assigned to from physics section where query engine Slave-QE1 It is executed on point.And so on, the execution node of the GetColumn operator of each fragment is held in the node where corresponding data Row.Greedy algorithm is used for the selection that non-GetColumn operator executes node, non-GetColumn operator executes node at it It is chosen in the execution node of son's operator node, calculates separately the execution generation executed on every son operator node physical node Valence selects the smallest physical node of Executing Cost to execute.Principle basis cost calculation formula: Executing Cost=network cost+calculating Network load × transmitted data amount+node tasks load × calculates data volume between cost=node.In Fig. 5, Join1-1 operator It executes node or from query engine Slave-QE1 or from query engine Slave-QE3, selects to draw from inquiry here It is exactly by calculating separately Join1-1 operator from query engine Slave-QE1 that Slave-QE1, which is held up, as the foundation for executing node It Executing Cost on node and determines calculating from the Executing Cost on query engine SlaveQE-3 node from query engine The upper Executing Cost of Join1-1 is smaller, so final execution physical node is selected as from query engine Slave-QE1.

Step S5 is added to task queue from query engine by subtask, and the forerunner subtask in current subtask is whole Current subtask is executed after the completion of executing, current subtask is executed into the intermediate data for completing to generate and is transmitted to subsequent subtask institute Slave query engine, and current subtask completion status is sent to main query engine.Specifically, each subtask include etc. Pending state, calculating state, distributing data mode, the state that is finished and execute this five kinds of states of status of fail, And each subtask can safeguard the list of forerunner subtask and subsequent subtask list of the subtask, the execution of each subtask State transition graph is as shown in Figure 6.

By taking Join1-1 operator shown in Fig. 4 as an example, forerunner's operator list is GetColumn(L_PARTKEY Slice 1 [1-100]), GetColumn (C_CUSTKEY Slice 1 [1-150]), Consequence operator list is Join2-1 operator. Join1-1 operator original state be etc. it is pending, the physical node where Join1-1 operator receives its all forerunner's operator and sends After the data come, Join1-1 operator state is changed to calculating, and after Join1-1 operator calculates completion, will change when pre-operator To distribute data, and calculation result data is sent to by network to the slave query engine where Consequence operator.Data are sent Success, when pre-operator task execution is completed.If wherein a certain step breaks down, i.e., etc. pending state and calculating state it Between, calculating between state and distribution data mode or distributing data mode and break down between the state that is finished, Operator state can be set to execute failure.Certainly, the every generation one-shot change of Join1-1 operator state, can all look into master in real time It askes engine and reports and work as pre-operator state.The execution of each operator is independent from each other, and in each operator implementation procedure, state is once Change, will the main query engine of asynchronous notifications, and result data is pushed to the execution physical node where Consequence operator. In this way, whether the execution of the subtask forerunner subtask that places one's entire reliance upon completes, without as Spark or MapReduce is the same, goes execution task stage by stage.

It in step s 5, is the column by compression processing from query engine and from the intermediate data transmitted between query engine Data include that position compression processing and dictionary compression are handled using compression processing method, by taking data structure shown in Fig. 7 as an example, in Between data include three vectors, i.e. dictionary vector, offset vector sum position vector.Dictionary vector arranges initial data Sequence, then duplicate removal processing, the data of redundancy are abandoned, and save memory storage space.As for offset vector sum position vector, by What is stored inside the two vectors is integer, uses position Compression Strategies here.In a computer, an INT type accounts for four Byte, i.e. 32bit, denotable data area -2147483648~2147483647, for offset shown in Fig. 7 to Amount and position vector, the maximum value of integer can be decided in vector.So in most cases, storing a number and using Not 32bit.Assuming that the maximum value of integer is A in offset vector or position vector, then used in one number of storage Bit number rounds up for log2A, compares conventionally employed INT type or LONG type variable to store integer, in this way More save memory.

Step S6, after the completion of entire inquiry plan, main query engine notifies client most to terminate from query engine acquisition Fruit data.So far, entire inquiry work is completed.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of querying method of distributed memory columnar database, which is characterized in that the method is arranged based on distributed memory The query engine system of formula database, the system comprises:

The query engine system of distributed memory columnar database, including resource management module, at least one main query engine with And at least one is from query engine；

The main query engine is used to sql like language being converted to inquiry plan, and inquiry plan is divided at least two subtasks, And it is responsible for the implementation procedure of monitoring and scheduling inquiry plan；

It is described to be used to execute the subtask that the main query engine distributes from query engine；

The resource management module is used to be responsible for the management and distribution of system resource；

The described method includes: resource management module determines the session that a main query engine is responsible between user；

The sql like language that user sends is converted to inquiry plan by main query engine；

Resource management module is that main query engine is distributed from query engine, and is established between query engine and main query engine Communication；

Inquiry plan is divided at least two subtasks by main query engine, and is distributed for each subtask from query engine；

It is added to task queue from query engine by subtask, is held after the completion of the forerunner subtask of current subtask all executes Current subtask is executed the intermediate data for completing to generate and is transmitted to drawing where subsequent subtask from inquiry by the preceding subtask of the trade It holds up, and current subtask completion status is sent to main query engine；

After the completion of entire inquiry plan, main query engine notifies client obtaining final result data from query engine；

Subtask uses physics operator representation, and the physics operator includes extracting column data operation, attended operation, condition filter behaviour Work, division operation, aggregate function operation, at least one in the operation of sorting operation and table that final result data convert is embarked on journey Kind；

The state of each subtask include etc. pending state, calculating state, distribute data mode, the state that is finished with And execute status of fail；

The original state of current subtask such as is at the pending state, receives current son from query engine where current subtask and appoints It is engaged in after the intermediate data that all forerunner subtasks execution completions generate, is changed to the state of current subtask calculating state； After the completion of current subtask calculates, the state of current subtask is changed to distribution data mode, and the mediant generated will be calculated According to the slave query engine being sent to where subsequent subtask；If intermediate data is sent successfully, the state of current subtask is changed to Be finished state；If waiting between pending state and calculating state, calculating between state and distribution data mode Or distribute data mode and break down between the state that is finished, the state of current subtask is changed to execute failure shape State；When state in current subtask changes, the main query engine of asynchronous notifications；Main query engine is every according to Cost Model A subtask is distributed from query engine；Main query engine distributes for each subtask from query engine according to Cost Model

According to the data obtained from the metadata information of query engine from the IP of node where query engine and node storage Library table information and column information；

Each in principle distribution inquiry plan, which is localized, according to data extracts the execution node IP of column data operation；

The execution node of non-extraction column data operation is chosen using greedy algorithm.

2. the querying method of distributed memory columnar database according to claim 1, which is characterized in that from query engine Between the intermediate data that transmits be column data by compression processing.

3. the querying method of distributed memory columnar database according to claim 2, which is characterized in that at the compression Reason includes that position compression processing and dictionary compression are handled.