CN105183901A

CN105183901A - Method and device for reading database table through data query engine

Info

Publication number: CN105183901A
Application number: CN201510640513.3A
Authority: CN
Inventors: 郭李明
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2015-09-30
Filing date: 2015-09-30
Publication date: 2015-12-23

Abstract

The embodiment of the invention discloses a method and device for reading a database table through a data query engine. The method includes the steps that after a logic execution plan is generated, segmentation information of the corresponding database table of the execution plan is set; the corresponding database table is segmented into at least two data fragments according to the segmentation information; the at least two data fragments are packaged into execution tasks respectively, and the execution tasks are dispatched to different computation nodes for data reading respectively. According to the technical scheme, after a response to a query request is made, multi-threading parallel processing of the data query engine for the data reading phase is achieved, the working time of the data reading phase stage is shortened, accordingly, the query performance of the data query engine is improved, the overall query speed of the data query engine is increased, and meanwhile the throughput rate of the data query engine and the user experience are improved.

Description

A kind of method of data query engine reading database table and device

Technical field

The embodiment of the present invention relates to interactive data query engine technology, particularly relates to a kind of method and device of data query engine reading database table.

Background technology

Presto is FaceBook (face book) product of increasing income, and is a kind of distributed interactive query engine comparatively popular at present, is applicable to large data interaction formula analysis and consult.The JDBC link block that Presto data query engine definitions one is general, various relevant database can be connected by suitable accommodation, carry out the fast query of associated databases, common relevant database has ORACLE, SQLServer and MySQL etc.Fig. 1 is that Presto data query engine client 10 submits the querying flow schematic diagram after an inquiry request to its server end, as seen from Figure 1, the server end of Presto is made up of management node 11 and computing node 12, its query script is: when the inquiry request that Presto data query engine accepts sends to client 10, first carry out the executive plan of resolving formation logic by management node 11, then by scheduling node 13 dispatch deal, be assigned to concrete computing node by executive plan to perform, then carry out query processing by computing node 12.

This shows, the Presto digital independent stage is in initial period of whole query script, and be the data source header that data processing calculates, the speed of digital independent largely have impact on the speed of whole query script.For existing Presto data query engine, during reading database table, an a database table only corresponding computing node, namely only has a computing node to be used for reading corresponding database table of executing the task.If the data volume of database table is very large, then very long the and self EMS memory storing data will also there is the excessive and hidden danger of overflowing of data volume in the time loss that computing node reads data.Presto is one and is devoted to make and responds and meet interactively query engine fast, if the digital independent stage is consuming time longer, the query responding time of whole Presto can be affected, obviously, when annexation type database is as data source, the pattern of the single table of single computing node process deviate from design original intention and the Consumer's Experience also corresponding reduction of Presto from far away.

Summary of the invention

The invention provides a kind of method and device of data query engine reading database table, to reduce the time of data query engine data fetch phase, promote query performance, and then improve the overall inquiry velocity of data query engine, promote the throughput of data query engine simultaneously.

On the one hand, embodiments provide a kind of method of data query engine reading database table, comprising:

After formation logic executive plan, set the segmental information of the associated databases table of described executive plan;

According to described segmental information, described associated databases table is cut at least two data fragmentations;

Described at least two data fragmentations are encapsulated into respectively and execute the task, dispatch respectively and carry out digital independent to different computing node.

On the other hand, embodiments provide a kind of device of data query engine reading database table, be integrated in data query engine, this device comprises:

Segmental information setting module, for after formation logic executive plan, sets the segmental information of the associated databases table of described executive plan;

Data fragmentation generation module, for according to described segmental information, is cut at least two data fragmentations by described associated databases table;

Data fragmentation scheduler module, executing the task for being encapsulated into respectively by described at least two data fragmentations, dispatching respectively and carrying out digital independent to different computing node.

The method of a kind of data query engine reading database table provided in the embodiment of the present invention and device, in data query engine response inquiry request, after carrying out syntax parsing and formation logic executive plan, the segmental information of setting executive plan associated databases table, according to segmental information, associated databases table is cut at least two data fragmentations, finally data fragmentation is dispatched and carry out digital independent to different computing node.Utilize and the method achieve the multi-threading parallel process of data query engine to data fetch phase, decrease the working time in digital independent stage, and then improve the query performance of data query engine and overall inquiry velocity, promote throughput and the Consumer's Experience of data query engine simultaneously.

Accompanying drawing explanation

Fig. 1 is data query engine Presto structural representation in prior art;

The schematic flow sheet of the method for a kind of data query engine reading database table that Fig. 2 provides for the embodiment of the present invention one;

The schematic flow sheet of the method for a kind of data query engine reading database table that Fig. 3 provides for the embodiment of the present invention two;

The schematic flow sheet of the method for a kind of data query engine reading database table that Fig. 4 A provides for the embodiment of the present invention three;

The structural representation of the Presto application scenarios in the method for a kind of data query engine reading database table that Fig. 4 B provides for the embodiment of the present invention three;

The structured flowchart of the device of a kind of data query engine reading database table that Fig. 5 provides for the embodiment of the present invention four.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.

Embodiment one

The schematic flow sheet of a kind of data query engine reading database table method that Fig. 2 provides for the embodiment of the present invention one, the method can be performed by the device of data query engine reading database table, wherein this device is integrated in data query engine as a part for data query engine, the method of the data query engine reading database table that the embodiment of the present invention one provides, as shown in Figure 2, following operation is comprised:

Step 201, after formation logic executive plan, set the segmental information of the associated databases table of described executive plan.

In the present embodiment, described formation logic executive plan specifically can be: after the inquiry request of data query engine server end customer in response end, the management node of data query engine server end is first to the Structured Query Language (SQL) (StructuredQueryLanguage of inquiry request, SQL) resolve, based on the executive plan of parsing content formation logic, described logic executive plan can distribute as query task in the scheduling steps of management node.Exemplary, the content of logic executive plan just can be regarded as the query task that will carry out, as being the goods orders quantity submitted between inquiry Beijing 2015.08.23 and 2015.08.25, the Transaction Success quantity wait query task inquiring about certain commodity in August, 2015.

In the present embodiment, the associated databases table of described executive plan specifically can refer to: the database table that the executive plan that inquire about is corresponding in relevant database, comprises the data message needed for executive plan in described database table.The segmental information of the associated databases table of the described executive plan of described setting specifically can be: to set when cutting carried out to associated databases table will the information of foundation, the correlation attribute information setting that described segmental information will be able to be inquired about according to executive plan.

Further, the segmental information of the associated databases table of the described executive plan of described setting, comprising:

Set the cutting field of described associated databases table, total cutting field minimum value, total cutting field maximal value and read the computing node number of data, as described segmental information.

In the present embodiment, described cutting field specifically can refer to the particular attribute-value of described associated databases table, described property value type is integer type, and can according to the data message of database table from increasing production life, and such as time, month, number of days etc. are convenient to the numerical information calculated.The computing node number of described reading data specifically can be the computing node number participating in described executive plan digital independent.

In the present embodiment, described total cutting field minimum value and total cutting field maximal value specifically can be inquiry initial value and the end value of the database table that foundation cutting field limits in the associated databases table of executive plan.In segmental information setting, described the setting of total cutting field minimum value with total cutting field maximal value to be obtained with communicating of described associated databases table mainly through data query engine, its obtaining step specifically describe be: after data query engine and database table are set up and are communicated, based on sql like language correlative obtain.Exemplary, described acquisition associated databases table cutting field maximal value and minimum value SQL statement can be expressed as: SELECTMAX (SPLITFIELD), MIN (SPLITFIELD) FROM table name, wherein, SPLITFIELD is the selected property value as cutting field, table name is title, the corresponding total cutting field maximal value of MAX (SPLITFIELD), the corresponding total cutting field minimum value of MIN (SPLITFIELD) of the associated databases table of executive plan.

Step 202, according to described segmental information, described associated databases table is cut at least two data fragmentations.

In available data query engine, for an executive plan, first executive plan carries out digital independent to one of them computing node as reading task scheduling by management node, the data message being responsible for that now described computing node is responsible for reading whole database table corresponding to executive plan to read data after data stored in self EMS memory, if the data volume of described database table is excessive, computing node can consume the reading of a lot of time, and self EMS memory also exists the hidden danger of data from overflow simultaneously.

In the present embodiment, according to described segmental information, the associated databases table of executive plan can be divided at least two data fragmentations, described data fragmentation specifically can be the part table in database table, carry the partial information of database table, a database table can be divided at least two data fragmentations based on segmental information.

Step 203, described at least two data fragmentations are encapsulated into respectively and execute the task, dispatch respectively and carry out digital independent to different computing node.

Usually, in data query engine, the task that computing node performs can be digital independent task and data processing task, and all executing the task is responsible for process by computing node, each computing node the specific tasks be responsible for distribute mainly through the scheduling steps of management node.

In the present embodiment, described in execute the task and specifically can refer to distribute to the digital independent task of computing node, in described digital independent task, encapsulation be no longer the whole database table information of executive plan, a but data fragmentation.That is, multiple data fragmentations of a database table are packaged into multiple digital independent task respectively, then dispatch and carry out data read operation to different computing nodes.In the operation to digital independent task, the read operation of different computing node synchronously carries out, and saves the time in digital independent stage in whole query script like this.Further, described computing node can also other of receiving management node scheduling be executed the task when carrying out reading task, namely a computing node can participate in multiple task process simultaneously, can not affect the operational processes of other task because of a lot of computing node is used for parallel read data.

The method of a kind of data query engine reading database table that the embodiment of the present invention one provides, in data query engine response inquiry request, after carrying out syntax parsing and formation logic executive plan, the segmental information of setting executive plan associated databases table, according to segmental information, associated databases table is cut at least two data fragmentations, finally data fragmentation is dispatched and carry out digital independent to different computing node.The multi-threaded parallel utilizing the method to realize data query engine reads, and reduces the digital independent stage and reads the time, promote query performance, and then improves overall inquiry velocity and the throughput of data query engine, promotes the Consumer's Experience of data query engine simultaneously.

Embodiment two

The method flow schematic diagram of a kind of data query engine reading database table that Fig. 3 provides for the embodiment of the present invention two.The embodiment of the present invention two is optimized based on above-described embodiment, in the present embodiment, described associated databases table is cut at least two data fragmentations by step be optimized for: calculate the difference between described total cutting field maximal value and total cutting field minimum value, as the summary journal line number of described associated databases table; By described summary journal line number divided by the computing node number reading data, to calculate data fragmentation side-play amount; According to described data fragmentation side-play amount cutting database table.

Accordingly, as shown in Figure 3, the method for the embodiment of the present invention comprises the steps:

Step 301, after formation logic executive plan, set the segmental information of the associated databases table of described executive plan.

In the present embodiment, the setting of described segmental information is specifically carried out in the management node of data query engine server end, and the logic executive plan that Main Basis generates carries out cutting setting to corresponding database table.

Step 302, according to described segmental information, calculate the summary journal line number of associated databases table.

In the present embodiment, the summary journal line number of described associated databases table specifically can be: the total line number needing data correspondence in database table of carrying out inquiring about in executive plan.Because cutting field is integer type, directly can carry out arithmetic operation to cutting field value, so the summary journal line number of described database table obtains by the difference of total cutting field maximal value and total cutting field minimum value.The acquisition of summary journal line number information, gives up data message unnecessary in Relational database table before being used in management node operation dispatching step, determines effective data set scope simultaneously.

Step 303, according to described segmental information and summary journal line number, calculate data fragmentation side-play amount.

In the present embodiment, described data fragmentation side-play amount specifically can refer to: the record line number of the database table comprised at the most in a data fragmentation.In addition, can also think that data fragmentation side-play amount is exactly divide equally the summary journal line number calculated based on the computing node number reading data in the segmental information set, ensure that the data message amount that each data fragmentation comprises is close thus, and then the workload of the read operation carried out at digital independent stage each computing node is also close.

In the present embodiment, obtain data fragmentation side-play amount by calculating summary journal line number divided by the computing node number reading data, that is, the business the two calculated is as data fragmentation side-play amount.During calculating data fragmentation side-play amount, need not consider whether the two can divide exactly.

Step 304, according to data fragmentation side-play amount cutting database table.

In the present embodiment, after calculating data fragmentation side-play amount, cutting can be carried out based on the associated databases table of data fragmentation side-play amount to executive plan.

Further, according to data fragmentation side-play amount cutting database table, comprising:

From described total cutting field minimum value, add that described data fragmentation side-play amount is as the first data fragmentation;

Add 1 from the end cutting field value of a upper data fragmentation, add described data fragmentation side-play amount as current data burst;

Judge whether comprise described total cutting field maximal value in current data burst, if so, then stop cutting, and using current data burst as final data burst; If not, then the step determining current data burst is repeated.

In the present embodiment, can a data burst information be regarded as can read in a computing node data set scope, the setting of scope is exactly the cutting to database table, its cutting method is: the cutting carrying out database table from total cutting field minimum value, add data fragmentation side-play amount, just can obtain the first data fragmentation, next add 1 from the end position value of the first data fragmentation, add data fragmentation side-play amount and just can obtain the second data fragmentation, next add 1 from the end position value of a upper data fragmentation successively, add data fragmentation side-play amount and can obtain current data burst, until the data fragmentation obtained comprises the cutting that total cutting field maximal value just stops database table.To comprise the data fragmentation of total cutting field maximal value as final data burst, wherein, the end position value of a data fragmentation is the end cutting field value of this data fragmentation simultaneously.

Further, described total cutting field maximal value is less than or equal to the end cutting field value of final data burst.

In the present embodiment, because the value of data fragmentation side-play amount is the business of summary journal line number divided by the computing node number of reading data.The aliquant possibility of both existence, Zong now cutting field maximal value is exactly not necessarily the end cutting field value of final data burst, is likely less than this and terminates cutting field value.Do the trouble not only eliminating data fragmentation side-play amount and calculate like this, prevent factor data in database table to increase and the situation of the missing data of appearance.

Step 305, described at least two data fragmentations are encapsulated into respectively and execute the task, dispatch respectively and carry out digital independent to different computing node.

In the present embodiment, before the operation carrying out data fragmentation to database table occurs in the scheduling steps in management node, in fact, the cutting of a database table can be regarded as and add one group of filtercondition in the SQL statement submitting to database execution, the form of this filtercondition is: the starting position of data fragmentation is not less than total minimum value of cutting field and the end position of data fragmentation is not more than total cutting field maximal value.After database table filters generation at least two data fragmentations based on filtercondition, in data fragmentation be encapsulated into respectively execute the task by the scheduling steps of management node again, and point task different computing node by being packaged with executing the task of data fragmentation, realize the parallel reading of computing node to same tables of data.

The method of a kind of data query engine reading database table that the embodiment of the present invention two provides, in data query engine response inquiry request, after carrying out syntax parsing and formation logic executive plan, the segmental information of setting executive plan associated databases table, calculates the associated databases table summary journal line number relevant with executive plan according to segmental information; Then calculate the data fragmentation side-play amount being used for cutting database table, according to data fragmentation side-play amount, database table can be cut at least two data fragmentations; Finally data fragmentation is dispatched and carry out digital independent to different computing node.The method is utilized to realize the division of valid data information in database table and the filtration of redundant data information, ensure that the multi-threaded parallel of data query engine reads, decrease digital independent phases-time, improve query performance and throughput, and then accelerate the overall inquiry velocity of data query engine, improve the Consumer's Experience of data query engine simultaneously.

Embodiment three

As shown in Figure 4 A, the embodiment of the present invention three provides a kind of preferred embodiment of method of data query engine reading database table, the present embodiment is on the basis of the Presto application scenarios shown in Fig. 1, further increase configuration module 14, for storing the segmental information of each tables of data in pre-configured database, the application scenarios of concrete the present embodiment as shown in Figure 4 B.Lower mask body composition graphs 4A, the concrete query script of application scenarios shown in further explanatory drawings 4B, as shown in Figure 4 A, the schematic flow sheet that the method giving the embodiment of the present invention reads based on Presto data query engine implementation data parallel, specifically comprises the steps:

Step 401, management node receive the data inquiry request sent from client;

Step 402, described management node carry out syntax parsing to described data inquiry request and generate executive plan, and described executive plan is sent to scheduling node;

Described syntax parsing is the first step performed in Presto data query engine management node, mainly resolves the SQL statement of inquiry request.

Step 403, described scheduling node according to described in execute the task the segmental information of the associated databases table obtaining described executive plan from described configuration module;

Described formation logic executive plan is the second step performed in Presto data query engine management node, after syntax parsing, mainly generate corresponding executable logic executive plan.A database table in described executive plan corresponding relation type database.

Step 404, according to described segmental information, described associated databases table is cut at least two data fragmentations.

The method of the reading database table that the embodiment of the present invention provides, is in fact before the management node of data query engine carries out task scheduling step, from described configuration module, obtains segmental information, is then encapsulated into by segmental information in executing the task.

Described setting segmental information be exactly setting executive plan associated databases table segmental information.Described segmental information comprises: the computing node number of cutting field, total cutting field minimum value, total cutting field maximal value and reading data.Cutting field is the particular attribute-value of associated databases table, and type is integer type; Total cutting field maximal value and total cutting field minimum value to be communicated with Database by SQL statement and obtain.

In this step, described associated databases table, according to described segmental information, is cut at least two data fragmentations by Presto data query engine.Its dicing process is: first using the difference of total cutting field maximal value and total cutting field minimum value as summary journal line number; Then summary journal line number is used divided by the business of the computing node number of reading data as data fragmentation side-play amount; Last from total cutting field minimum value successively data fragmentation side-play amount until comprise total cutting field maximal value and terminate, be multiple data fragmentation by database table cutting.Wherein, these data fragmentations are designated as the first data fragmentation, the second data fragmentation ... and final data burst.

Described at least two data fragmentations are encapsulated into and execute the task by step 405, described scheduling node respectively, dispatch respectively and carry out digital independent to different computing node;

During the data fragmentation of executive plan associated databases table is encapsulated into and executes the task by this step respectively, then read data, to realize the parallel reading to database table of Presto data query engine to different computing node by dispatching distribution.

Step 406, each computing node read data in corresponding burst respectively according to described segmental information from tables of data.

After the management node of Presto data query engine processes database table, in the digital independent stage, a computing node is responsible for the data message of a reading data fragmentation, and by data message stored in self EMS memory.

The embodiment of the present invention three provides a kind of preferred embodiment of method of data query engine reading database table, in the preferred embodiment, Presto data query engine is after response inquiry request, the method of data query engine reading database table provided by the invention is utilized to carry out cutting process to database table, the multi-threaded parallel realizing Presto data query engine reads, reduce the reading time in digital independent stage, improve Presto query performance, and then improve overall inquiry velocity and the throughput of Presto data query engine, promote the Consumer's Experience of Presto data query engine simultaneously.

Embodiment four

The apparatus structure block diagram of a kind of data query engine reading database table that Fig. 5 provides for the embodiment of the present invention four, the part that this device can be used as data query engine is integrated in data query engine.As shown in Figure 5, this device comprises:

Segmental information setting module 501, for after formation logic executive plan, the segmental information of the associated databases table of setting executive plan;

Data fragmentation generation module 502, for according to segmental information, is cut at least two data fragmentations by associated databases table;

Data fragmentation scheduler module 503, executes the task for being encapsulated into respectively by least two data fragmentations, dispatches respectively and carries out digital independent to different computing node.

The device of a kind of data query engine reading database table that the embodiment of the present invention four provides, be integrated in data query engine, before the scheduler module of data query engine management node will be executed the task and distributed to computing node, by the segmental information of the associated databases table of segmental information setting module configuration logic executive plan, again by data fragmentation generation module, based on segmental information, database table is cut at least two data fragmentations, final data fragmentation is encapsulated in one by one executes the task by scheduler module again, realize the parallel reading of multiple computing node to Relational database table.Utilize described device can realize data query engine multi-threaded parallel reading database data message after response data inquiry request, improve the speed in digital independent stage, reduce digital independent phases-time, promote query performance and throughput, and then reduce the overall query time of data query engine, promote the Consumer's Experience of data query engine simultaneously.

Further, segmental information setting module 501, specifically for:

The computing node number of the cutting field of setting associated databases table, total cutting field minimum value, total cutting field maximal value and reading data, as segmental information.

Further, data fragmentation generation module 502, comprising:

Summary journal line number computing unit, for calculating the difference between described total cutting field maximal value and total cutting field minimum value, as the summary journal line number of described associated databases table;

Data fragmentation offset computation unit, for by the computing node number of described summary journal line number divided by reading data, to calculate data fragmentation side-play amount;

Data fragmentation cutting unit, for according to described data fragmentation side-play amount cutting database table.

Further, data fragmentation cutting unit, specifically for:

Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.

Claims

1. a method for data query engine reading database table, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, the segmental information of the associated databases table of the described executive plan of described setting, comprising:

3. method according to claim 2, is characterized in that, described described associated databases table is cut at least two data fragmentations, comprising:

Calculate the difference between described total cutting field maximal value and total cutting field minimum value, as the summary journal line number of described associated databases table;

By described summary journal line number divided by the computing node number reading data, to calculate data fragmentation side-play amount;

According to described data fragmentation side-play amount cutting database table.

4. method according to claim 3, is characterized in that, described according to described data fragmentation side-play amount cutting database table, comprising:

5. method according to claim 4, is characterized in that, described total cutting field maximal value is less than or equal to the end cutting field value of final data burst.

6. a device for data query engine reading database table, is integrated in data query engine, it is characterized in that, comprising:

7. device according to claim 6, is characterized in that, described segmental information setting module, specifically for:

8. device according to claim 7, is characterized in that, described data fragmentation generation module, comprising:

9. device according to claim 8, is characterized in that, described data fragmentation cutting unit, specifically for:

10. device according to claim 9, is characterized in that, described total cutting field maximal value is less than or equal to the end cutting field value of final data burst.