CN109062929B

CN109062929B - Query task communication method and system

Info

Publication number: CN109062929B
Application number: CN201810596030.1A
Authority: CN
Inventors: 陈榕; 陈海波; 臧斌宇; 管海兵; 王思源
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2020-11-06
Anticipated expiration: 2038-06-11
Also published as: CN109062929A

Abstract

The invention provides a communication method and a communication system for query tasks, which comprise the following steps: analyzing the query request at a server of the received query request, and decomposing a query statement in the query request into a plurality of sub-steps, wherein relevant information of the sub-steps belongs to metadata of a query task; processing the query request step by step from the first substep of the plurality of substeps to obtain an intermediate query result; and if the data depended by the next sub-step is in a remote server, respectively sending the query intermediate result and the metadata of the query task to the remote server in a GPUDirect RDMA and RDMA mode, and continuing to process the query request by the remote server according to the received query intermediate result and the metadata of the query task. The invention reduces the cost of the whole communication process, avoids the contention of network resources and improves the performance of the whole inquiry system.

Description

Query task communication method and system

Technical Field

The invention relates to the technical field of communication, in particular to a query task communication method based on GPUDirect RDMA.

Background

In the big data era, the data size is getting bigger and bigger, for example, the number of web pages of the internet is as large as billions, and the huge data is often divided into a plurality of parts to be stored in a plurality of machines. To find data of interest in a vast data set, the software that provides the query service is typically run in a distributed environment consisting of multiple machines.

With the continuous development of hardware technology, a server equipped with a high-performance Graphics Processing Unit (GPU) is gradually appeared in a data center, and the GPU has stronger computing performance and higher memory bandwidth than the CPU, so the GPU is often used as an accelerator for computing tasks and is used as a supplement for the CPU. The great-grained GPU, which is widely used in data centers, has its own dedicated memory, which is separated from the system memory (CPU memory) used by the CPU. Therefore, before a calculation task is executed on the GPU, data required for calculation needs to be copied to the GPU memory before the calculation task can be initiated on the GPU.

When processing a query task in a distributed computing environment, it is often necessary to send intermediate results of the query task, involving intercommunication between machines. For example, when the server a sends the intermediate result of the query task to the server B, the intermediate result data needs to be copied from the GPU memory to the CPU memory first, and then the data is sent to the CPU memory of the server B through the network, and the server B copies the data to the GPU memory to continue processing the query task. Obviously, frequent data copying between the GPU and the CPU during the communication process significantly increases the time consumption of the query task, and may cause poor user experience for querying tasks with low delay tolerance.

RDMA: remote Direct Memory Access.

The recent development of GPUDirect RDMA technology by Invviata (NVDIA) aims to reduce unnecessary memory copy during communication between GPU servers and to directly send data in the GPU memory of server A to the GPU memory of server B through a high-performance network. This provides a new possibility for inter-server communication in a distributed computing environment. However, how to utilize the new technology to reduce the processing delay of the query task is a technical problem to be solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a query task communication method and a query task communication system.

The communication method for the query task provided by the invention comprises the following steps:

a registration step: distributing and loading the data set on each server in the cluster, and registering a GPU memory and a CPU memory on the servers for GPUDirect RDMA and RDMA respectively;

a query request step: sending the query request to a server in the cluster;

analyzing and inquiring: analyzing the query request at a server of the received query request, and decomposing a query statement in the query request into a plurality of sub-steps, wherein relevant information of the sub-steps belongs to metadata of a query task;

query processing steps: processing the query request step by step from the first substep of the plurality of substeps to obtain an intermediate query result;

if the data depended on by the next sub-step is in the remote server, respectively sending the query intermediate result and the metadata of the query task to the remote server in a GPUDirect RDMA and RDMA mode, and continuing to process the query request by the remote server according to the received query intermediate result and the metadata of the query task.

Preferably, the registering step includes:

and loading a data set on the servers in the cluster, carrying out initialization work, and respectively registering a GPU memory and a CPU memory in each server.

Preferably, the query requesting step includes:

and after receiving the query request, the server initializes the relevant data of the query task, and empties the intermediate result table to prepare for processing the query task.

Preferably, the step of parsing the query includes:

the server analyzes the query request, the query request comprises a plurality of query statements, and the query request is decomposed into a plurality of sub-steps to be executed according to different query statements; before each substep is performed, the data dependent on the substep is copied from the CPU memory to the GPU memory, and then the processing logic of the substep is executed on the GPU.

Preferably, the query processing step includes:

the server processes the query request from the first substep, and performs matching operation on the data set by using the query condition in the substep; the control flow logic of the query request is executed on a CPU, the matching operation on the data set is executed on a GPU, and a query intermediate result obtained by the matching operation is stored in a GPU memory; the data set is dispersedly stored in the whole cluster, and after a server receiving the query request locally executes a part of sub-steps, the server judges whether the data depended on by the next sub-step is local or not, and if so, the server continues to process the subsequent sub-steps; if not, then sending the intermediate result to the remote server, and executing the subsequent sub-steps by the remote server based on the obtained intermediate result;

the server sending the query intermediate result to the remote end comprises the following steps: taking the initial address of the GPU memory and the size of the query intermediate result as parameters, calling the unilateral operation of the RDMA network card, writing the query intermediate result into the GPU memory of the remote server, and querying the data information of the intermediate result belonging to the query task;

after the server sends the query intermediate result, metadata of a query task needs to be sent, the subsequent sub-steps of the query request are recorded in the metadata, and the remote server executes the subsequent sub-steps according to the metadata; the server serializes the metadata, copies the serialized metadata to a CPU memory, calls the single-side operation of the RDMA network card by taking the initial address of the buffer area and the size of the metadata as parameters, and writes the metadata into the CPU memory of the remote server.

Preferably, after receiving the intermediate result of the query, the remote server copies the intermediate result from the GPU memory to another GPU memory, and records the start address of the another GPU memory;

the remote server continues to receive the metadata of the query task, copies the metadata from the CPU memory to another CPU memory, and obtains the metadata information of the query task after deserialization; storing the recorded starting address of the GPU memory into metadata;

and the remote server executes the control flow logic of the query task on the CPU according to the metadata, continues to execute the subsequent substeps, copies the data depended by the substeps from the CPU memory to the GPU memory, and performs the matching operation of the data set on the GPU based on the intermediate result obtained previously.

The invention provides a query task communication system, which comprises:

a registration module: distributing and loading the data set on each server in the cluster, and registering a GPU memory and a CPU memory on the servers for GPUDirect RDMA and RDMA respectively;

the query request module: sending the query request to a server in the cluster;

an analysis query module: analyzing the query request at a server of the received query request, and decomposing a query statement in the query request into a plurality of sub-steps, wherein relevant information of the sub-steps belongs to metadata of a query task;

the query processing module: processing the query request step by step from the first substep of the plurality of substeps to obtain an intermediate query result;

Preferably, the registration module includes: loading a data set on servers in a cluster, carrying out initialization work, and respectively registering a GPU memory and a CPU memory in each server; the query request module comprises: initializing relevant data of the query task after the server receives the query request, emptying an intermediate result table and preparing for processing the query task;

the parsing query module comprises: the server analyzes the query request, the query request comprises a plurality of query statements, and the query request is decomposed into a plurality of sub-steps to be executed according to different query statements; before each substep is performed, the data dependent on the substep is copied from the CPU memory to the GPU memory, and then the processing logic of the substep is executed on the GPU.

Preferably, the query processing module includes:

Compared with the prior art, the invention has the following beneficial effects:

1. according to the query task communication method based on GPUDirect RDMA, disclosed by the invention, the intermediate result generated when the query task is processed on the GPU can be directly sent to the GPU memory of the remote server from the local GPU memory, so that the copy times of data between the GPU memory and the CPU memory in the communication process are reduced, and further the expense of the whole communication process is reduced.

2. The invention decouples the sending of the data information (query intermediate result) and the control information (metadata) of the query task, the data information uses GPUDirect RDMA, the control information uses RDMA, and the data information and the control information are separately sent by using different communication channels, thereby avoiding the contention of network resources.

3. The query task communication method based on GPUDirect RDMA is suitable for a server cluster provided with a GPU supporting the GPUDirect RDMA technology and a network card, and avoids redundant data copy in the communication process, so that the processing delay of the query task can be reduced, and the performance of the whole query system can be improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, a query task communication method provided by the present invention includes:

a registration step: and distributing and loading the data set on each server in the cluster, carrying out initialization work, and registering a GPU memory and a CPU memory on the servers for GPUDirect RDMA and RDMA respectively. These two memory regions are referred to as the "GPU RDMA buffer" and the "CPU RDMA buffer", respectively.

A query request step: sending the query request to a server in the cluster;

analyzing and inquiring: analyzing the query request at a server of the received query request, and decomposing a query statement in the query request into a plurality of sub-steps, wherein relevant information of the sub-steps belongs to metadata (control information) of a query task;

Specifically, in the query request step: and after receiving the query request, the server initializes the relevant data of the query task, and empties the intermediate result table to prepare for processing the query task.

In the step of resolving the query: the server analyzes the query request, the query request comprises a plurality of query statements, and the query request is decomposed into a plurality of sub-steps to be executed according to different query statements; before each substep is performed, the data dependent on the substep is copied from the CPU memory to the GPU memory, and then the processing logic of the substep is executed on the GPU.

In the query processing step: the server processes the query request from the first substep, and performs matching operation on the data set by using the query condition in the substep; the control flow logic of the query request is executed on a CPU, the matching operation on the data set is executed on a GPU, and a query intermediate result obtained by the matching operation is stored in a GPU memory; the data set is dispersedly stored in the whole cluster, and after a server receiving the query request locally executes a part of sub-steps, the server judges whether the data depended on by the next sub-step is local or not, and if so, the server continues to process the subsequent sub-steps; if not, then sending the intermediate result to the remote server, and executing the subsequent sub-steps by the remote server based on the obtained intermediate result;

the server sending the intermediate result of the query to the remote end comprises the following steps: taking the initial address of the GPU memory and the size of the query intermediate result as parameters, calling the unilateral operation of the RDMA network card, writing the query intermediate result into the GPU memory of the remote server, and querying the data information of the intermediate result belonging to the query task;

after the server sends the query intermediate result, metadata of a query task needs to be sent, the subsequent sub-steps of the query request are recorded in the metadata, and the remote server executes the subsequent sub-steps according to the metadata; the server serializes the metadata, copies the serialized metadata to a CPU memory, calls the single-side operation of the RDMA network card by taking the initial address of the buffer area, the size of the metadata and the like as parameters, and writes the metadata into the CPU memory of the remote server. The metadata of the query task includes, but is not limited to, the following information: 1) the size of the obtained query intermediate result; 2) a query substep resolved by the server; 3) a variable for storing GPU memory addresses.

After receiving the intermediate result of the query, the remote server copies the intermediate result from the GPU memory to another GPU memory and records the initial address of the other GPU memory;

On the basis of the query task communication method, the invention further provides a query task communication system, which comprises the following steps:

a registration module: and distributing and loading the data set on each server in the cluster, and registering a GPU memory and a CPU memory on the servers for GPUDirect RDMA and RDMA respectively. These two memory regions are referred to as the "GPU RDMA buffer" and the "CPU RDMA buffer", respectively.

The query request module: sending the query request to a server in the cluster;

Specifically, the query request module includes: initializing relevant data of the query task after the server receives the query request, emptying an intermediate result table and preparing for processing the query task; the analysis query module comprises: the server analyzes the query request, the query request comprises a plurality of query statements, and the query request is decomposed into a plurality of sub-steps to be executed according to different query statements; before each substep is performed, the data dependent on the substep is copied from the CPU memory to the GPU memory, and then the processing logic of the substep is executed on the GPU.

The query processing module comprises: the server processes the query request from the first substep, and performs matching operation on the data set by using the query condition in the substep; the control flow logic of the query request is executed on a CPU, the matching operation on the data set is executed on a GPU, and a query intermediate result obtained by the matching operation is stored in a GPU memory; the data set is dispersedly stored in the whole cluster, and after a server receiving the query request locally executes a part of sub-steps, the server judges whether the data depended on by the next sub-step is local or not, and if so, the server continues to process the subsequent sub-steps; if not, then sending the intermediate result to the remote server, and executing the subsequent sub-steps by the remote server based on the obtained intermediate result;

after the server sends the query intermediate result, metadata of a query task needs to be sent, the subsequent sub-steps of the query request are recorded in the metadata, and the remote server executes the subsequent sub-steps according to the metadata; the server serializes the metadata, copies the serialized metadata to a CPU memory, calls the single-side operation of the RDMA network card by taking the initial address of the buffer area, the size of the metadata and the like as parameters, and writes the metadata into the CPU memory of the remote server.

Further specifically, because the sending end separately sends the query intermediate result and the metadata, the receiving end needs to continue to receive the queried metadata after receiving the intermediate result, and compares whether the size of the query intermediate result recorded in the metadata is consistent with the size of the received intermediate result, so as to ensure that the integrity of the intermediate result and the metadata is not damaged in the network transmission process.

The query task communication method provided by the invention is realized based on the complete history record, the complete history record stores the intermediate result generated in each sub-step in the query task processing process, and the use of the complete history record has the advantage that the final result merging operation of the traditional single-step pruning method can be avoided, which is time-consuming because the single-step pruning method still has the result which does not meet the query condition after the query processing is finished, and finally all the results need to be concentrated on one server for final merging operation, which may become the performance bottleneck of the whole system.

The invention adopts a communication method based on GPUDirect RDMA instead of the traditional communication method, which has the following problems:

(1) without the support of GPUDirect RDMA technology, the data in the GPU memory is transferred between the servers, and multiple memory copy operations are needed, so that the response time of the query request is increased;

(2) data information and control information are transmitted together, and control and data streams are coupled together and contend for the same network resources, thereby reducing the performance of the transmitting end.

Compared with the traditional communication method, the GPUDirect RDMA-based communication method has the following advantages that:

1. the query intermediate result in the GPU memory of the server can be directly sent to the GPU memory of the remote server from the local GPU memory through a high performance network (RDMA), so that the data is prevented from being copied between the GPU memory and the CPU memory in the communication process, and the cost of the whole communication process is reduced;

2. the sending of data information (query intermediate result) and control information (metadata) of the query task is decoupled, the data information uses GPUDirect RDMA, the control information uses RDMA, and the contention of network resources is avoided.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A query task communication method, comprising:

a query request step: sending the query request to a server in the cluster;

if the data depended on in the next substep is in a remote server, respectively sending the query intermediate result and the metadata of the query task to the remote server in a GPUDirect RDMA and RDMA mode, and continuing to process the query request by the remote server according to the received query intermediate result and the metadata of the query task;

the query processing step includes:

2. The query task communication method according to claim 1, wherein the registering step includes:

3. The query task communication method according to claim 1, wherein the query request step includes:

4. The query task communication method according to claim 1, wherein the step of parsing the query comprises:

5. The query task communication method according to claim 1, wherein the remote server copies the intermediate result from the GPU memory to another GPU memory after receiving the intermediate result of the query, and records a start address of the another GPU memory;

6. A query task communication system, comprising:

the query request module: sending the query request to a server in the cluster;

the query processing module comprises:

7. The query task communication system of claim 6, wherein the registration module comprises: loading a data set on servers in a cluster, carrying out initialization work, and respectively registering a GPU memory and a CPU memory in each server; the query request module comprises: initializing relevant data of the query task after the server receives the query request, emptying an intermediate result table and preparing for processing the query task;

8. The query task communication system according to claim 6, wherein the remote server copies the intermediate result from the GPU memory to another GPU memory after receiving the intermediate result of the query, and records a start address of the another GPU memory;