CN109416683B

CN109416683B - Data processing apparatus, database system, and communication operation method of database system

Info

Publication number: CN109416683B
Application number: CN201680084285.9A
Authority: CN
Inventors: 德米特里·谢尔盖耶维奇·科尔马科夫; 亚历山大·弗拉基米罗维奇·斯莱萨连科; 张学仓
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-04-05
Filing date: 2016-04-05
Publication date: 2022-04-05
Anticipated expiration: 2036-04-05
Also published as: WO2017176144A1; CN109416683A

Abstract

A data processing apparatus (40) for performing part of the operations of a distributed database system (404). The data processing device (40) comprises: a logic planner (42) for generating a logic plan based on the database query; a physical planner (43) for generating a physical plan based on the logical plan; a marking unit (44) for: determining a communication operator within the physical plan, wherein the communication operator is an operator containing a communication; determining a communication mode of the communication operator based on the operator type of the communication operator; the determined communication operators are tagged, each operator having a data tag that includes a communication mode of the determined communication operator. Further, the data processing device (40) comprises: a code generator (45) for: executable code is generated based on the physical plan and the data tags are converted into communicator instructions. Further, the data processing device (40) comprises: a code executor (46) for executing the executable code; a communicator (47) for communicating with other data processing devices (402 and 403) within the distributed database system (404) based on the communicator instructions.

Description

Data processing apparatus, database system, and communication operation method of database system

Technical Field

The present invention relates to the field of computer software engineering, and more particularly, to distributed database systems.

Background

A distributed database system has a plurality of different nodes, also referred to as data processing devices. When a database query is executed, communication between these different nodes occurs. Especially in database systems with multiple nodes, such communication may become a bottleneck for the database system.

As shown in FIG. 1, an SQL query execution pipeline can be broken up into many steps.

1. Planning: in a first step 12, the query plain text 11 is converted into a logic plan, located in the middle of the tree of the query pipeline. The logical plan is optimized and converted to a physical plan in a second step 13, which may also be optimized taking into account data parameters. The physical plan consists of physical plan operators, representing some basic operations performed on a data set according to a database low-level interface.

2. Code generation: in a third step 14, executable code is generated based on the physical plan. This improves the performance of the database system. This approach is used for some exemplary frameworks of distributed SQL query operations: sparkSQL, Cassandra, etc.

3. Executing: in a fourth step 15, the code prepared in the previous step 14 is executed. In the case of a distributed database, it is performed synchronously once on a cluster of workstations connected in the network.

The connection between the data 20 consisting of the data blocks 21 to 23 and the execution plan 24 using a plurality of nodes 25 to 27 is depicted in fig. 2, in particular showing that the different data blocks 21 to 23 are stored on different nodes 25 to 27 and that only interactions are made between these different nodes 25 to 27. The execution plan 24 may bring up the results 28.

When executing a distributed query, the data to be processed is spread within the cluster so that each machine stores only a portion of the source data set. Nevertheless, some operations on a data set may require data exchange between cluster nodes, such as performing a set operation that accumulates a single value based on the entire data set. Such network communication may significantly degrade the performance of the database system, whether involving large data sets or not performing in an optimal manner.

Different SQL physical layer operators may generate network traffic that can match different communication patterns. These communication modes are shown in fig. 3a to 3 c. Fig. 3a shows an end-to-end communication pattern between two nodes 30. The multicast communication mode between a plurality of nodes 31 is shown in fig. 3b, where the communication originates from one of the nodes 31 and terminates at a plurality of nodes 31. A many-to-many communication scheme between a plurality of nodes 32 is shown in fig. 3c, in which each node can communicate with every other node.

All of these modes are widely used for distributed query execution. Multicasting is used for data replication; a many-to-many communication mode is used for reordering; the end-to-end mode serves as the basis for all other types of communication. An exemplary solution for distributed query operation does not distinguish between these patterns. However, network performance is largely dependent on the implementation of the communication mode, and a dedicated transport protocol may have better performance for certain specific communication modes.

The exemplary protocol TCP, which may be used at the transport layer, is well suited for end-to-end communication because communication within this protocol is performed over previously created end-to-end connections. Multicast communication modes implemented using TCP are inefficient because the same data should be transmitted multiple times, i.e., over respective connections to each destination, resulting in a large amount of repetitive traffic in the network.

The Spark framework enables broadcast communication by using the BitTorrent application layer protocol, which is still TCP based. This approach can speed up the broadcast, but has several disadvantages:

1. however, some nodes can receive broadcast data from neighboring nodes, reducing the load of the sending node, but cannot solve the problem of duplicated data packets in general. The network should transmit the same number of packets as the number of nodes in the network.

2. The additional protocols acting on top of the transport layer result in extra overhead, which will seriously affect the broadcast performance of small messages.

In contrast, the TIPC transport layer protocol natively supports a multicast communication mode and exhibits significantly better performance than TCP in this mode. However, TIPC has poorer end-to-end performance than TCP, and therefore there is no clear answer to the question which protocol should be used.

We have found that the exemplary solution suffers from three main problems:

1. the exemplary solution is based on a single networking transport protocol that is statically selected by certain parameters. This approach incurs the overhead of data exchange within the communication mode and the selected protocol cannot be used well.

2. The generic protocol is intended to be applicable to all possible scenarios. Using a generic protocol results in overhead due to protocol commonality even though it is only used for a few usage patterns.

3. Additional logic on top of the transport layer may result in additional overhead.

Disclosure of Invention

It is therefore an object of the present invention to provide an apparatus and method that allows efficient communication within a distributed database.

A first aspect of the invention provides a data processing apparatus for performing part of the operations of a distributed database system. The data processing apparatus includes: a logic planner to generate a logic plan based on the database query; a physical planner to generate a physical plan based on the logical plan; a marking unit for determining a communication operator in the physical plan, wherein the communication operator is an operator containing communication; determining a communication mode of the communication operator based on the operator type of the communication operator; the determined communication operators are tagged, each operator having a data tag that includes a communication mode of the determined communication operator. Further, the data processing apparatus comprises a code generator for generating executable code based on the physical plan and converting the data tag into communicator instructions; a communicator for communicating with other data processing devices within the distributed database system based on the communicator instructions. Therefore, the communication task can be separated from the conventional database operation, so that the very efficient communication is realized, and the very efficient database operation is realized.

In a first implementation form of the first aspect, the database query is an SQL query. This allows the use of readily available database components.

In an implementation form of the first aspect, the marking unit is configured to determine operators for enabling communication within the distributed database system, in particular copy operators as communication operators, and/or map reduce operators, and/or sort operators, and/or reorder join operators, and/or hash join operators, and/or broadcast hash join operators, and/or merge join operators. This allows for very efficient database operation.

In one implementation of the first aspect, the tagging element is configured to distinguish a set of network communication modes based on communication operators. Additionally or alternatively, the tagging element is to determine an end-to-end communication pattern for a copy operator, and/or a mapreduce operator, and/or an order operator, and/or a reorder join operator, and/or a hash join operator, and/or a broadcast hash join operator, and/or a merge join operator. Additionally or alternatively, the tagging element is to determine a multicast or broadcast communication mode for a copy operator and/or a broadcast hash join operator. Additionally or alternatively, the tagging element is to determine a many-to-many communication pattern for reordering join operators, and/or hashing join operators, and/or merging join operators. Thus, a highly flexible selection of operator-based communication modes can be achieved.

In a further implementation of the first aspect or the illustrated implementation described above, the communicator is configured to dynamically determine a communication protocol to be used for the respective operator based at least on the communicator instruction. This allows for extremely efficient database operations.

In an implementation manner of the above implementation manner, the data tag further includes a total amount of data to be transmitted by the operation character; the communicator is further configured to dynamically determine a communication protocol to be used for each operator based on a total amount of data to be communicated by the operator. This allows for better selection of the most appropriate communication protocol, thereby improving the efficiency of database operation.

In another implementation of the first two implementations, the communicator is configured to communicate based on a communication protocol determined for each operator. This allows for extremely efficient database operations.

In the first aspect or another implementation of the first aspect implementation shown above, the data processing apparatus further includes a storage unit configured to store at least a portion of data stored in the distributed database system. By dividing data between different data processing devices in a distributed database system, extremely efficient database operations are achieved.

In the first aspect or another implementation of the first aspect implementation shown above, the data processing apparatus further comprises a query receiver configured to receive a database query. This allows for processing standardized database queries.

In a further implementation form of the first aspect or of any implementation form above, the communicator is configured to transmit at least a part of the data to be processed to the other data processing device. This may not store all data in all data processing devices, thereby saving memory space.

A second aspect of the invention provides a database system comprising at least a first data processing device according to the first aspect or any implementation of the first aspect and a second data processing device according to the first aspect or any implementation of the first aspect. The communicator of the first data processing device is configured to communicate with at least the second data processing device based on the determined communicator instruction. Thereby an extremely efficient database system is achieved.

In one implementation form of the second aspect, the database system comprises at least a third data processing device. The communicator of the first data processing device is configured to perform communication with at least a second data processing device and a third data processing device based on the determined communicator instruction. Thereby an extremely efficient database system is achieved.

A third aspect of the invention provides a method for operating a database system comprising a plurality of data processing devices. The method comprises the following steps: generating a logic plan based on the database query; generating a physical plan based on the logical plan; determining communication operators within the physical plan; determining a communication mode of the communication operator based on the operator type of the communication operator; marking the determined communication operators, wherein each operator has a data marking comprising a communication mode of the determined communication operator. Further, the method comprises: generating executable code based on the physical plan; converting the data tag into a communicator instruction; executing the executable code; communicate with other data processing devices within the distributed database system based on the communicator instructions. Thereby enabling extremely efficient operation of the distributed database system.

A fourth aspect of the invention provides a computer program having a program code for performing the method of the third aspect of the invention when the computer program runs on a computer, thereby enabling extremely efficient database operations.

In summary, to solve the above-mentioned discovered problems, a new method for implementing pipelined network communication for distributed database query execution is proposed. The following steps are advantageously carried out:

1. the planning phase expands as follows:

nodes (physical operators) that contain the query physical plan for the communication are identified and labeled.

2. The code generation phase is expanded as follows:

adding a data wrapper to create an application level message according to the data token added in step 1, comprising: (a) a communication mode identifier; (b) additional service information; (c) data is to be exchanged.

3. The execution phase is expanded as follows:

a. the SQL query execution system is extended using a communicator that encapsulates all transport layer protocols to be used at runtime.

b. The communication mode identifier and the additional service information are transmitted to the communicator together with the data.

4. An optimal communication protocol for data exchange within the specified communication mode is dynamically selected.

In general, it should be noted that all arrangements, devices, elements, units, means, etc. described in the present application may be implemented by software or hardware elements or any kind of combination thereof. Further, the apparatus may be or may comprise a processor, wherein the functions of the elements, units and devices described in this application may be implemented by one or more processors. All steps performed by the various entities and functions performed by the various entities described in this application are intended to mean that the respective entities are adapted or used to perform the respective steps and functions. Even in the following description or specific embodiments, the specific functions or steps performed by a general entity are not reflected in the description of specific detailed elements of the entity performing the specific steps or functions, it should be clear to a skilled person that the methods and functions can be implemented by software or hardware elements or any kind of combination thereof.

Drawings

The invention is explained in detail below with respect to embodiments of the invention and with reference to the drawings, in which:

FIG. 1 illustrates an exemplary database query execution pipeline;

FIG. 2 illustrates exemplary distributed data processing;

FIG. 3a illustrates an exemplary end-to-end communication scheme;

FIG. 3b illustrates an exemplary multicast communication mode;

FIG. 3c illustrates an exemplary many-to-many communication mode;

FIG. 4 shows a first embodiment of a data processing device according to a first aspect of the present invention;

FIG. 5 illustrates an extension of the planning phase employed by the second embodiment of the first aspect of the present invention;

FIG. 6 illustrates an exemplary physical plan generated in a third embodiment of a data processing apparatus according to the first aspect of the present invention;

FIG. 7 illustrates an exemplary physical plan of data tag extensions employed by a fourth embodiment of a data processing apparatus according to the first aspect of the present invention;

fig. 8 shows details of a fifth embodiment of a data processing device according to the first aspect of the present invention;

FIG. 9 illustrates an exemplary database query execution process employed by a sixth embodiment of a data processing apparatus according to the first aspect of the present invention;

fig. 10 shows communication over different networks having a plurality of transport layer protocols for use by a seventh embodiment of a data processing device according to the first aspect of the present invention;

fig. 11 shows details of an eighth embodiment of a data processing device according to the first aspect of the present invention;

FIG. 12 illustrates an exemplary traffic multiplexing between different nodes in a distributed database system;

FIG. 13 illustrates a first embodiment of a method of operation of a database system according to the third aspect of the present invention;

FIG. 14 shows achievable results when using the data processing apparatus according to the first aspect of the invention or the method of operation of the distributed database system according to the third aspect of the invention for an entire test case;

FIG. 15 illustrates the achievable results when using the data processing apparatus according to the first aspect of the invention or the method of operation of the distributed database system according to the third aspect of the invention for a single broadcast hash join operation;

fig. 16 shows achievable results when using the data processing apparatus according to the first aspect of the invention or the method of operation of the distributed database system according to the third aspect of the invention for processing time (% relative to baseline).

Detailed Description

First, fig. 1 and 2 illustrate the functionality of an exemplary distributed database system. Fig. 3a to 3c depict different communication modes. Different embodiments of the data processing device according to the first aspect of the present invention are described in connection with fig. 4 to 12. Fig. 13 details the functionality of the method according to the third aspect of the invention. Finally, fig. 14 to 16 show the achievable efficiency increase.

Similar entities and reference numerals have been partially omitted in different figures.

The terms and their meanings are as follows:

fig. 4 shows a first embodiment of a database system 404 of the second aspect of the present invention comprising a first embodiment of a data processing device 40 of the first aspect of the present invention. The data processing device 40 comprises a query receiver 41, which query receiver 41 is connected to a logical planner 42, which logical planner 42 is again connected to a physical planner 43. The physical planner 43 is connected to a marking unit 44, and the marking unit 44 is connected to a code generator 45. Furthermore, the code generator 45 is connected to a code executor 46, which code executor 46 is in turn connected to a communicator 47. All units 41 to 47 are connected to a control unit 48. Further, the communicator 47 is connected to a network 401, and the network 401 is connected to other

data processing apparatuses

402 and 403. The network 401 and the

data processing apparatuses

402 and 403 are not part of the data processing apparatus 40, but constitute a distributed database system 404 with the data processing apparatus 40. The control unit 48 controls the functions of all other units of the data processing device 40.

In a distributed database system, a query receiver 41 receives database queries, particularly SQL queries. The query is processed and passed to the logic planner 42. The logic planner 42 generates a logic plan based on the database query. The logical plan is passed to a physical planner 43, which the physical planner 43 generates a physical plan from the logical plan. The physical plan is passed to a marking element that determines communication operators in the physical plan, communication operators being operators that contain communications. The marking element 44 then determines a communication mode of the communication operator based on the operator type of the communication operator. In particular, the tagging unit determines operators that enable communication within the distributed database system, in particular copy operators, and/or map-reduce operators, and/or sort operators, and/or reorder-join operators, and/or hash-join operators, and/or broadcast hash-join operators, and/or merge-join operators determined as communication operators. Finally, a tag element tags certain communication operators, each operator having a data tag that includes a certain communication mode of the communication operator.

In addition, tagging element 44 distinguishes a set of network communication modes based on communication operators. In particular, the tagging unit determines an end-to-end communication pattern for a copy operator, and/or a mapreduce operator, and/or an order operator, and/or a reorder operator, and/or a join operator, and/or a hash join operator, and/or a broadcast hash join operator, and/or a merge join operator. At the same time, the tag unit determines the multicast or broadcast communication mode for the copy operator and/or the broadcast hash join operator. In addition, marking element 44 is operable to determine a many-to-many communication mode for reordering join operators, and/or hashing join operators, and/or merging join operators.

The tagged physical plan is then passed to a code generator, which generates executable code according to the physical plan and converts the data tags into communicator instructions. These communicator instructions are passed to a code executer 46 which executes executable code. Further, the communicator 47 communicates with the other

data processing apparatuses

402 and 403 based on the communicator instruction.

Further, the marking unit 44 marks the total amount of data to be transferred by the operator. The communicator 47 then determines a communication protocol for each operator based on the total amount of data to be transferred and based on the communication mode.

SQL is a practical standard for database access methods. The following is an example of an SQL Q3 query in an exemplary TPC-H benchmark test:

the SQL query text is processed by a database engine which in a first step converts the SQL query text into a tree-like presentation graph called a logic plan. Then, the database engine executes logic optimization to generate an optimized logic plan, and then the optimized logic plan is converted into a low-level database API basis. The plan is called a physical plan, and can also be optimized by considering physical parameters of the database.

The leaves of the physical plan tree represent data sources, and the nodes are physical plan operators representing different basic operations on the relational database.

The end of the tree chain adds an additional physical planning process step, shown in detail in fig. 5. The input 50 consists of a physical plan 51 and data-related information 52. This input 50 is passed to a function block 53 which constitutes an extended planning phase. The function block 53 includes detecting a communication-related physical plan operator 54. As input for identifying communication-related operators, the correspondence of operators and communication patterns stored in knowledge base 56 is used as input. In addition, the expand plan phase also includes marking the detected operators 55. Finally, the expanded physical plan is passed as output 57.

Fig. 6 illustrates an exemplary physical plan 60. Physical plan 60 includes a plurality of operators 601 through 614.

Communication operators

601, 603, 604 and 605 are marked with dotted lines and these

operators

601, 603, 604 and 605 are detected.

Fig. 7 illustrates an exemplary extended physical plan 70. Here,

additional data markers

71, 72, 73, 74, and 75 are integrated into the extended physical plan 70. Each of the data tags 71 to 75 contains information about the employed communication mode. Other information may be stored within the data tag.

The data tags are rigidly associated with the associated physical plan operators and convey information about the communication mode, such as its communication mode ID, for further data exchange.

The communication mode ID associated with a particular physical plan operator is selected using, for example, the following table. The table specifies a correspondence between physical plan operators and communication modes.

FIG. 8 shows details of the code generation phase. The extended physical plan 81 generated by the tagging unit 44 of fig. 4 serves as an input. The extended physical plan is passed to a code generator 82, the code generator 82 corresponding to the code generator 45 of FIG. 4. The code generator 82 uses information stored in a code generation library 84. In particular, code generator 82 uses a set of transformation rules for physical plan operator 85 and a set of library modifications 86 that include data markup transformations and modifications for existing methods. As output, executable code 83 is generated by code generator 82.

One possible exemplary method of code generation is to convert the physical plan into code written in a general purpose language, such as C + +. The method has the advantages that: the generated code may be compiled into executable code by additional optimizations provided by the compiler and then the code generation is performed by a special module, code generator 82 converting the tree representation of the physical plan into executable plain code. The extended physical plan generated in the previous step contains a new type of physical plan operator-a data tag, which is also translated in the executable code. The code generator is thus extended by a translator for data-tagged physical plan operators.

In addition to the converter for data tagging, the converter for existing physical plan operators needs to be modified to provide the communication layer with access to communication related data.

An exemplary prototype query execution system is referred to as Flint. Flint allows execution of a query physics plan, represented in C + +, which can be viewed as the code generation phase output. A detailed description of this method is given later.

In the execution phase, the generated program runs on the distributed cluster. All nodes in the cluster process a portion of the incoming data synchronously, meaning that they perform the same operation on the data at the same time. FIG. 9 illustrates SQL query execution for an individual node of the cluster, the rendered graph presenting executions 90 of particular tagged physical operators.

As input, data 91 to be processed and data markers 92 are used. The data tag 92 includes communication-related information 98 such as a communication mode ID and additional service information. Further, the generated executable code 93 is processed, thereby performing processing of the local data 94, the data to be exchanged 95, and the result data 96. In addition, the data to be exchanged 95 is transmitted by a communicator 99, the communicator 99 corresponding to the communicator 47 of fig. 4. After the executable code 93 is processed, the next operator 97 is processed.

Fig. 10 shows a process performed by the communicator. Code execution 101 includes an executor application 102 that is extended using a communicator 103. The communicator 103 encapsulates all

transport layer protocols

104, 105 and 106 to be used at runtime. Each of the

transport protocols

103, 104 and 105 forms a network 1001 between all

cluster nodes

107, 108 and 109 and their addressing is explicitly converted into the addressing used by the application. To this end, the communicator 103 forms a conversion table storing the correspondence between the addresses of the application layer and the transport layer.

Application layer addresses	TCP address	…	TIPC address
				0	192.168.1.0:5555	…	1.1.1:(100,1)
1	192.168.1.1:5555	…	1.1.2:(100,2)
				…	…	…	…
N	192.168.1.N:5555	…	1.1.3:(100,2)

The communicator 103 may be based on any number of

transport layer protocols

104, 105 and 106, and may even encapsulate other application layer protocols.

Fig. 11 shows a large-scale scheme of a communicator 113 corresponding to the communicator 103 of fig. 10 and the communicator 47 of fig. 4. The communication-related information 111 associated with the input data 110 is used for communication protocol selection 114 described below. The communication mode is selected based on information stored in the knowledge base 116. The selected protocol ID is transmitted to the transmitter 115 together with the data. The receiver 117 does not require any special operation to receive data. The received data is only passed to higher protocol layers as output data 112.

To select the communication protocol, all data to be exchanged is transferred from the application to the communicator 113. The data may or may not contain additional information related thereto. If the data has no service information, the data is transmitted using a default transmission protocol. The following table shows possible correspondences between the determined communication mode and the resulting communication protocol.

If the data is tagged with additional information, the data is transmitted using a transmission protocol that more closely matches the data exchange communication mode. Deciding what transport protocol to use involves the following:

static knowledge about the transport protocol obtained in advance;

dynamic data exchange parameters such as: communication mode ID, amount of data within the mode, level of optimization, etc.

Finally, the determined communication protocol is used for data transmission.

In particular, at the sending end, application generated traffic is multiplexed between supported transport layer protocols according to the protocol ID. At the receiving end, the data received under the different protocols are combined into one stream for transmission to the corresponding application without any transmission related information being transmitted to higher protocol layers, so that the data is transmitted as it is without any additional fields related thereto.

The transmitter of node X1201 and the receiver of node Y1208 are shown in fig. 12. The transmitter of node X includes a multiplexer 1204 and a plurality of

protocol stacks

1205, 1206, and 1207. The input data 1202 and the corresponding protocol ID 1203 are transferred to a multiplexer 1204, the multiplexer 1204 selects the corresponding protocol stack 1205 to 1207 based on the protocol ID 1203, and transmits the data 1202 to the receiver of the node Y1208 using the corresponding protocol stack 1205 to 1207. The receiver of the node Y1208 includes a plurality of protocol stacks 1209 to 1211 and a demultiplexer 1212. When data is received by using the specific protocol stacks 1205 to 1207 of the transmitter of the node X1201, the corresponding protocol stacks 1209 to 1211 of the receiver of the node Y1208 decode the data demultiplexed by the demultiplexer 1212 and provided as output data 1213.

FIG. 13 illustrates a flow diagram of an embodiment of a method for operating a distributed database system. In a first step 130, a logical plan is generated from the received database query, and then in a second step 131, the generated logical plan is used to generate a physical plan. In a third step 132, communication operators within the physical plan are determined. In a fourth step 133, the communication mode of the communication operators within the physical plan is determined to be yes. In a fifth step 134, the determined communication operators are tagged within the physical plan, generating an expanded physical plan. In a sixth step 135, executable code is generated from the extended physical plan. In a seventh step 136, the data tags within the extended physical plan are converted into communicator instructions. In an eighth step 137, the executable code is executed. In a final step 138, the communicator instructions generated in the seventh step 136 are used to perform communication between the different nodes of the distributed database system.

It should also be noted that the statements made in relation to the data processing device also apply to the method of operating a distributed database system.

Fig. 14 to 16 show the acceleration of database queries using the aforementioned method, in particular based on the two protocols TCP and TIPC. As a benchmark, a standard method is applied: in this case, an end-to-end oriented protocol is used. As a benchmark, a popular TPC-H decision support benchmark is used. The TPC-H decision support benchmark consists of a complete set of service-oriented temporal queries and concurrent data modifications. Queries and data that populate the database are selected to have a wide range of industry-wide relevance. The benchmark tests set forth a decision support system that can examine large amounts of data, perform highly complex queries, and provide answers to key business questions. The results of the Q8 query are shown here using a scale factor 100 in the table that produces approximately 100Gb data. FIG. 14 shows the results of an exemplary execution query Q8, and in particular, the results of Q8 without using a broadcast hash join, to illustrate the performance benefits of not using a broadcast hash join in both methods, the standard and the method according to the present invention. It is noted that a plurality of nodes are shown on the x-axis and the execution time of the query is shown on the y-axis.

It is clear that the method according to the invention is advantageous over the two exemplary solutions, irrespective of the number of nodes, although the described improvements are not as great as expected. Because in the test in fig. 14 the broadcast hash join operation processes relatively small data portions and its duration does not have a large impact on the result. To more accurately illustrate the benefits of the present invention, an accelerated comparison of a single broadcast hash join operation is shown in FIG. 15. Here, a plurality of nodes are depicted on the x-axis, while the acceleration factor is shown on the y-axis.

The problem of analytical adjustability is also very attractive. The dependency between the execution time reduction and the plurality of nodes in the cluster is thus analyzed, which is shown in fig. 16. In fig. 16, the percentage of processing time (relative to the baseline TCP without broadcast) is shown on the y-axis, while the number of nodes is shown on the x-axis.

The main conclusion is that: using broadcast hash connections for 32 nodes may even reduce performance compared to standard methods. Performance gains are still shown when using the method of the invention.

As a further alternative, more than the previously proposed communication modes and corresponding transport layer protocols may be employed. Different transport layer solutions may be targeted for some special use cases or features, for example, a possible solution for many-to-many communication modes may provide a transport protocol that provides fair communication among all nodes.

The proposed method can be applied to other distributed computations of a set of different physical operators, such as reduction and prefix scanning. For example, an MPI transport layer may be used.

Some more details are given below regarding an implementation of embodiments of the computer program according to the third aspect of the invention. The software frame is named flint. Flint is a distributed SQL query execution framework that allows execution of a physical query plan represented in C + +. This query execution plan, written in C + +, may be assumed to be the output of the code generation phase. The following is a code example of the query Q3 in the TPC-H reference set, showing how the proposed code generation method is implemented. Highest level of Flint Q3 query-physical plan, with the following representation:

the data tag added in the previous step is converted into a special operator in the aforementioned list-

lines

7, 12, 18 and 24, appending the communication mode ID and the total data size estimate to the data to be exchanged.

The basis for the query implementation of the foregoing list is the Dataset base class:

the method marker () has the following implementation:

so, when method marker () is called, an instance of the MarkedDataset class is created and a pointer to it is returned. The MarkedDataset class has the following definition:

the method next () of overwriting is only the method next () that passes the record pointer to the input data set. Most importantly, the getServiceInfo () method is reloaded, which now returns a pointer to the service information provided to the marker () method.

The newly created service information may be used in a replication phase of the broadcast hash connection. The following is an implementation of the broadcastHashJoin () method for a Dataset class:

like the marker () method, the broadcastHashJoin () method also creates an instance of the broadcastHashJoin class and returns a pointer to it.

In constructing the BroadcastHashJoin class, a hash table, line 19 of the foregoing list, is created from the internal table copied to each node in the cluster.

The following are implementations of the replay () method and the repliatedataset class:

replication is achieved by BroadcastJob-line 17 of the foregoing list-prepare and send the specified data set:

finally, the service information related to the MarkedDataset, line 12 of the aforesaid list and line 20 of the preceding list, is collected and transmitted to the communicator together with the data to be transmitted.

The other communication modes use the same method. For example, a ScatterJob implementing a rearrangement program used by the physical plan operator Hash join and rearrangement join:

the invention is not limited to the examples shown above. The features of the exemplary embodiments can be used in any advantageous combination.

The invention is described herein in connection with various embodiments. Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored or distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the internet or other wired or wireless telecommunication systems.

Claims

1. A data processing apparatus for performing portions of the operations of a distributed database system, comprising:

a logic planner to generate a logic plan based on the database query;

a physical planner to generate a physical plan based on the logical plan;

a marking unit for:

determining a communication operator within the physical plan, wherein the communication operator is an operator containing a communication;

determining a communication mode of the communication operator based on the operator type of the communication operator;

tagging the determined communication operators, each operator having a data tag comprising a communication mode of the determined communication operator, wherein the data tag of the communication operator of the physical plan is used to select the corresponding communication mode for communication within the distributed database system, the data tag further comprising a total amount of data to be communicated by the operator;

a code generator to:

generating executable code based on the physical plan;

converting the data tag into a communicator instruction;

a code executor to execute the executable code;

a communicator for dynamically determining a communication protocol to be used for each operator based on a total amount of data to be transmitted by the operator;

the determined communication protocols to be used for the respective operators are communicated with other data processing devices within the distributed database system based on the communicator instructions.

2. The data processing device of claim 1, wherein the database query is an SQL query.

3. The data processing apparatus of claim 2, wherein the marking unit is configured to determine a copy operator as a communication operator, and/or a mapreduce operator, and/or an order operator, and/or a reorder join operator, and/or a hash join operator, and/or a broadcast hash join operator, and/or a merge join operator.

4. The data processing device of claim 3,

the tag unit is to distinguish a set of network communication modes based on communication operators; and/or

The marking unit is used for determining an end-to-end communication mode for a copy operator, and/or a mapreduce operator, and/or an order operator, and/or a reorder operator, and/or a join operator, and/or a hash join operator, and/or a broadcast hash join operator, and/or a merge join operator; and/or

The tagging element is to determine a multicast or broadcast communication mode for a copy operator and/or a broadcast hash join operator; and/or

The marking element is configured to determine a many-to-many communication pattern for reordering join operators, and/or hashing join operators, and/or merging join operators.

5. The data processing device of claim 4,

the communicator is operable to dynamically determine a communication protocol to be used for each operator based at least on communicator instructions.

6. The data processing device of claim 5,

the communicator is configured to communicate based on a communication protocol determined for each operator.

7. The data processing device of any one of claims 1 to 6,

the data processing device further comprises a storage unit for storing at least a part of the data stored in the distributed database system.

8. The data processing device of any one of claims 1 to 6,

the data processing apparatus further comprises a query receiver for receiving a database query.

9. The data processing device of any one of claims 1 to 6,

the communicator is for transmitting at least a portion of the data to be processed to other data processing devices.

10. Database system, characterized in that it comprises at least a first data processing device according to any of claims 1 to 9 and a second data processing device according to any of claims 1 to 9,

the communicator of the first data processing device is configured to communicate with at least the second data processing device based on the determined communicator instruction.

11. The database system of claim 10,

the database system comprises at least a third data processing device, wherein,

the communicator of the first data processing device is configured to perform communication with at least a second data processing device and a third data processing device based on the determined communicator instruction.

12. A method for operating a database system comprising a plurality of data processing devices, comprising:

generating a logic plan based on the database query;

generating a physical plan based on the logical plan;

determining communication operators within the physical plan;

tagging the determined communication operators, each operator having a data tag comprising a communication mode of the determined communication operator, wherein the data tag of the communication operator of the physical plan is used to select the corresponding communication mode for communication between the plurality of data processing devices of the database system, the data tag further comprising a total amount of data to be communicated by the operator;

generating executable code based on the physical plan;

converting the data tag into a communicator instruction;

dynamically determining a communication protocol to be used for each operator based on a total amount of data to be transmitted by the operator;

executing the executable code;

13. A computer storage medium, characterized in that the computer storage medium stores a computer program for performing the method of claim 12 when the computer program runs on a computer.