CN112181618A

CN112181618A - Data transmission method and device, computer equipment and storage medium

Info

Publication number: CN112181618A
Application number: CN202011001056.0A
Authority: CN
Inventors: 王森
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-01-05

Abstract

The embodiment of the disclosure relates to a data transmission method, a data transmission device, computer equipment and a storage medium, wherein a message queue is established, when a Map task is completed, output data of the Map task are added into the message queue, and the output data are transmitted to a corresponding Reduce task according to the sequence of the output data in the message queue, so that a data asynchronous transmission mode based on a MapReduce frame is realized, the data processing efficiency of MapReduce is improved, and the problem of network congestion easily caused by related technologies is solved.

Description

Data transmission method and device, computer equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of big data, in particular to a data transmission method and device, computer equipment and a storage medium.

Background

MapReduce is a distributed computing framework that performs large-scale data processing and computation mainly through a "Map" task (or may also be referred to as a Map task) and a "Reduce" task (or may also be referred to as a Reduce task).

The data transmission process between the Map task and the Reduce task is referred to as a "Shuffle" process (or may also be referred to as a Shuffle process) in the related art. In the process, only after all Map tasks are completed, the output data of all Map tasks are written into the local disk at one time, and the Reduce task can read the data from the disk and execute the corresponding tasks only after all the data are written into the local disk.

However, in the related art, the Reduce task must wait for all Map tasks to be completed before execution, which causes a problem of low data processing efficiency, and the network is congested by writing a large amount of data into a local disk at one time.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a data transmission method, an apparatus, a computer device, and a storage medium.

A first aspect of the embodiments of the present disclosure provides a data transmission method, where the method includes:

establishing a message queue; in response to the fact that the Map task is monitored to be completed, adding output data of the Map task into a message queue; and transmitting the output data to the corresponding Reduce task according to the sequence of the output data of the Map task in the message queue.

In one embodiment, the establishing a message queue includes:

establishing corresponding number of partitions based on the number of Reduce tasks; and establishing a message queue in each partition, and establishing a one-to-one correspondence between the message queue and the Reduce task.

In one embodiment, the output data of the Map task includes a field corresponding to the Map task and an execution result of the Map task; adding the output data of the Map task into the message queue, including:

determining a target message queue corresponding to the Map task according to the field corresponding to the Map task, wherein the target message queue is a message queue corresponding to a Reduce task for processing the Map task; and adding the output data of the Map task into the target message queue.

In an embodiment, the determining, according to the field corresponding to the Map task, a target message queue corresponding to the Map task includes:

calculating the remainder of the quotient of the hash value of the field and the Reduce task number; and determining a target message queue corresponding to the Map task based on the corresponding relation between the remainder and the message queue.

In one embodiment, before the adding the output data of the Map task to the message queue in response to the monitoring that the Map task is completed, the method further includes:

and establishing a Map task list based on all Map tasks, wherein the Map task list comprises the completion state of each Map task and the number of uncompleted tasks.

In one embodiment, after the adding the output data of the Map task to the message queue, the method further comprises:

and updating the number of the uncompleted tasks in the Map task list.

In one embodiment, after the transmitting the output data to the corresponding Reduce task, the method further comprises:

and closing all message queues in response to monitoring that the number of uncompleted tasks in the Map task list is 0.

In one embodiment, before transmitting the output data to the corresponding Reduce task, the method further comprises:

detecting whether the Reduce task is in an idle state or finishes running; transmitting the output data to a corresponding Reduce task, comprising:

and if the Reduce task is in an idle state or finishes running, transmitting the output data to the corresponding Reduce task. A second aspect of the embodiments of the present disclosure provides a data transmission apparatus, including:

the first establishing module is used for establishing a message queue.

And the data adding module is used for adding the output data of the Map task into the message queue when the completion of the Map task is monitored.

And the data transmission module is used for transmitting the output data to the corresponding Reduce task according to the sequence of the output data of the Map task in the message queue.

In one embodiment, the first establishing module is configured to:

In one embodiment, the output data of the Map task includes a field corresponding to the Map task and an execution result of the Map task;

the data adding module comprises:

a determining unit, configured to determine, according to the field corresponding to the Map task, a target message queue corresponding to the Map task, where the target message queue is a message queue corresponding to a Reduce task for processing the Map task;

and the adding unit is used for adding the output data of the Map task into the target message queue.

In one embodiment, the determining unit is configured to:

In one embodiment, the apparatus further comprises:

and the second establishing module is used for establishing a Map task list based on all Map tasks before detecting that the Map tasks are completed and adding the output data of the Map tasks into the message queue, wherein the Map task list comprises the completion state of each Map task and the number of uncompleted tasks.

In one embodiment, the apparatus further comprises:

and the updating module is used for updating the number of uncompleted tasks in the Map task list after the output data of the Map task is added into the message queue.

In one embodiment, the apparatus further comprises:

and the monitoring module is used for closing all the message queues when the number of the tasks which are not finished in the Map task list is monitored to be 0.

In one embodiment, the apparatus further comprises:

the detection module is used for detecting whether the Reduce task is in an idle state or finishes running before the output data is transmitted to the corresponding Reduce task;

and the data transmission module is used for transmitting the output data to the corresponding Reduce task when the Reduce task is in an idle state or finishes running. A third aspect of an embodiment of the present disclosure provides a computer device, including: a memory and a processor; wherein the memory has stored therein a computer program which, when executed by the processor, implements the method of the first aspect.

A fourth aspect of embodiments of the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

by establishing the message queue, every time when the Map task is completed, the output data of the Map task is added into the message queue, and transmits the output data to corresponding Reduce tasks according to the sequence of the output data in the message queue, realizes a data asynchronous transmission mode based on a MapReduce framework, the method can transmit the output data of the completed single Map task to the corresponding Reduce task in a message queue mode for processing without waiting for all the Map tasks to be completed, thereby saving the waiting time of the Reduce task, improving the data processing efficiency in MapReduce and improving the data processing efficiency in MapReduce, because the output data of all Map tasks in the embodiment of the present disclosure are waiting to be sent in the transmission pipeline of the message queue, it is not necessary to write the output data of all Map tasks to the disk at one time as in the related art, so that the network caused by one-time writing can be avoided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic diagram of a data transmission method based on a MapReduce framework provided in the related art;

fig. 2 is a flowchart of a data transmission method provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a data transmission method based on a MapReduce framework according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a method for establishing a message queue according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a data transmission method based on a MapReduce framework according to an embodiment of the present disclosure;

fig. 6 is a flowchart of a method for establishing a message queue according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a data transmission device according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

Fig. 1 is a schematic diagram of a data transmission method based on a MapReduce framework provided in the related art, and as shown in fig. 1, N Map tasks and M Reduce tasks may be preset in the MapReduce framework, where the number N of the Map tasks may be the same as or different from the number M of the Reduce tasks, and the values of N and M are positive integers.

In the related art, Map tasks and Reduce tasks are generally used for data statistics. For example, when a data file is input into the MapReduce framework, the data file is firstly divided into a plurality of data fragments, and a Map task is used for performing statistical processing on data in one data fragment, for example, counting a certain word in the data fragment, or counting the number of characters in the data fragment. And in the Map task execution process, all Reduce tasks do not execute data processing, and when all Map tasks are completed, the output data of all Map tasks are written into the disk once, so that the overhead of increasing ports by frequently executing disk writing operation is avoided. After the Reduce task in the MapReduce monitors that all Map tasks are completed, the Reduce task reads output data of the corresponding Map tasks from a disk, and then performs total statistics and summarization on the obtained output data of one or more Map tasks, for example, summarize the occurrence frequency of a certain specific word counted by each Map task to obtain the total occurrence frequency of the word in a data file, or summarize the counted word number of each Map task to obtain the total word number of the data file.

In the related art provided in fig. 1, in order to Reduce the number of write operations to the disk and Reduce the port overhead, a data write-once method is adopted in the related art, however, the method needs to wait for all Map tasks to be completed before writing the output data of all Map tasks to the disk once, before that, the data of any Map task cannot be written to the disk, and all Reduce tasks need to wait for all Map tasks to be completed before obtaining corresponding data from the disk. This greatly increases the processing time of data, and reduces the data processing efficiency. Meanwhile, the data volume of write-once is usually large, and when the write operation is performed, the problem of network congestion is often caused.

In order to solve the problems in the related art, the embodiment of the disclosure provides a data transmission scheme, where a message queue is established, and output data of each Map task is asynchronously transmitted to a Reduce task in the form of the message queue, so that the Reduce task can execute corresponding processing without waiting for completion of all Map tasks, thereby improving data processing efficiency and avoiding network congestion caused by one-time data access.

In order to clearly understand the technical solutions of the embodiments of the present disclosure, the following describes the technical solutions of the embodiments of the present disclosure with reference to the exemplary embodiments.

Fig. 2 is a flowchart of a data transmission method provided by an embodiment of the present disclosure, where the method is applied to a MapReduce framework. As shown in fig. 2, the method includes:

step 201, establishing a message queue.

Step 202, in response to the fact that the Map task is monitored to be completed, adding the output data of the Map task into a message queue.

And 203, transmitting the output data to the corresponding Reduce task according to the sequence of the output data of the Map task in the message queue.

The "message queue" in this embodiment is a container for holding messages during the transmission of the messages. And the data in the message queue are sequentially sent according to the sequence of the data in the message queue.

For example, fig. 3 is a schematic diagram of a data transmission method based on a MapReduce framework provided by the embodiment of the present disclosure, where in fig. 3, the data transmission method includes N Map tasks and M Reduce tasks, and there is a correspondence between the Map tasks and the Reduce tasks, that is, output data of each Map task is processed by a pre-specified Reduce task, and one Reduce task may correspondingly process output data of one or more Map tasks, where N and M are positive integers.

As shown in fig. 3, in the MapReduce framework, a data file input to the MapReduce framework is to be subjected to data slicing processing to obtain a plurality of data slices, and specifically, the number of the obtained data slices is determined by the data size of the data file itself, for example, 1280M in total is performed on the data file, and the upper limit of the data size of each slice is set to 128M, so that when the data slicing processing is performed, every 128M data can be divided into one data slice, and 10 data slices are obtained in total. Of course, this is merely an example and is not the only limitation on the data slicing method in the embodiments of the present disclosure.

In practical cases, the number of Map tasks is generally set to be greater than or equal to the number of data slices, and the number of data slices is configured to be equal to the number N of Map tasks as illustrated in fig. 3. In this embodiment one Map task processes one data slice at a time. The output data of the Map task includes a field key corresponding to the Map task and an execution result value of the Map task (for example, the number of certain words counted by the Map task from the data fragments, or the number of characters in the data fragments, etc.), where the field key corresponding to the Map task is a preset value used for indicating which Reduce task specifically processes the output data of the Map task.

And after the Map task is executed, adding output data of the task into the message queue, wherein the output data is sent in the message queue according to the sequence of the output data in the message queue. When the sending operation is executed, the output data can be sent to the Reduce task corresponding to the key value according to the specific value of the key in the output data. Therefore, the Reduce task can obtain data and execute corresponding processing operation without waiting for all Map tasks to be completed.

Or, in an embodiment, before sending the output data to the corresponding Reduce task, the running state of the Reduce task may be monitored to determine whether the Reduce task is in an idle state or has finished running, if so, the output data may be directly sent to the Reduce task, otherwise, the output data may be sent to the Reduce task again when the Reduce task is in an idle state or has finished running. Therefore, the output data can be accurately received, and the Reduce task is prevented from missing the output data.

The embodiment establishes the message queue, so that each time when a Map task is completed, the output data of the Map task is added into the message queue, and transmits the output data to corresponding Reduce tasks according to the sequence of the output data in the message queue, realizes a data asynchronous transmission mode based on a MapReduce framework, the method can transmit the output data of the completed single Map task to the corresponding Reduce task in a message queue mode for processing without waiting for all the Map tasks to be completed, thereby saving the waiting time of the Reduce task, improving the data processing efficiency in MapReduce and improving the data processing efficiency in MapReduce, in this embodiment, output data of all Map tasks are waiting to be sent in a transmission pipeline of a message queue, and it is not necessary to write the output data of all Map tasks to a disk at one time as in the related art, so that the network caused by one-time writing can be avoided.

Fig. 4 is a flowchart of a method for establishing a message queue according to an embodiment of the present disclosure. As shown in fig. 4, the method includes the steps of:

step 401, based on the number of Reduce tasks, establishing a corresponding number of partitions.

Step 402, establishing a message queue in each partition, and establishing a one-to-one correspondence between the message queue and the Reduce task.

The partition referred to in this embodiment may be exemplarily understood as a Kafka partition obtained based on the Kafka technology, but is not limited to the Kafka partition. The Kafka technique employed by the present embodiment is a high throughput distributed publish-subscribe messaging system that can handle all the action flow data of the consumer.

The Kafka method adopted in this embodiment is only one of many methods that can be used for establishing a message queue, and in other implementation methods, other methods in the related art may be adopted to establish a message queue in the embodiment of the present disclosure, where Kafka is taken as an example for description, and a method for establishing a message queue in other methods may refer to the Kafka method, and details thereof are not repeated here.

Fig. 5 is a schematic diagram of a data transmission method based on a MapReduce framework according to an embodiment of the present disclosure. In fig. 5, N Map tasks and M Reduce tasks are included. And establishing a corresponding message queue for each Reduce task to obtain the one-to-one correspondence between the message queues and the Reduce tasks. In this embodiment, each message queue has a unique identifier, for example, in an embodiment, the identifier of the message queue may be specifically [ I, R _ position ], where I is an index of the message queue, and the R _ position is used to determine a correspondence between the Map task and the message queue, for example, in an embodiment, the R _ position may be specifically a hash value of a key value, and after the output data of the Map task is obtained, the message queue including the hash value in the identifier may be determined as the target message queue corresponding to the Map task by calculating the hash value corresponding to the key value in the output data. For another example, in another embodiment, the R _ position may be specifically a key value itself, at this time, the key value output by the Map task may be directly compared with the R _ positions of the message queues, and the message queue with the consistent comparison is determined as the target message queue corresponding to the Map task. For another example, in another embodiment, the R _ position may also be specifically a remainder of a quotient of the hash value of the key value and the number of Reduce tasks, at this time, after the output data of the Map task is obtained, a remainder of a quotient of the hash value of the key value and the number of Reduce tasks in the output data may be calculated, and the target message queue corresponding to the Map task is determined according to a correspondence relationship between the remainder and the message queue (the remainder corresponds to the message queue whose identifier includes the remainder). That is to say, in this embodiment, the target message queue corresponding to the Map task may be determined according to the field key corresponding to the Map task, and the output data of the Map task is added to the target message queue for transmission, where the target message queue is a message queue corresponding to a Reduce task for processing the Map task.

Of course, the above description of the R _ position in this embodiment is only an example, and is not the only limitation to the embodiments of the present disclosure.

In this embodiment, a corresponding message queue is established for each Reduce task, so that data can be processed in parallel between the Reduce tasks in the MapReduce framework, waiting is not required between the Reduce tasks, and the data processing efficiency of the MapReduce framework is further improved.

Fig. 6 is a flowchart of a method for establishing a message queue according to an embodiment of the present disclosure. As shown in fig. 6, the method provided by this embodiment includes the following steps:

step 601, establishing a message queue, and establishing a Map task list based on all Map tasks, wherein the Map task list comprises the completion state of each Map task and the number of uncompleted tasks.

Step 602, in response to the Map task completion being monitored, adding the output data of the Map task into the message queue.

And step 603, updating the number of uncompleted tasks in the Map task list.

And step 604, transmitting the output data to the corresponding Reduce task according to the sequence of the output data of the Map task in the message queue.

The sequence between step 603 and step 604 may be arbitrary.

Step 605, in response to monitoring that the number of the tasks in the Map task list is 0, closing all the message queues.

In the embodiment, the Map task list is established, and the completion conditions of all Map tasks are maintained through the Map task list, so that the completion conditions of all Map tasks can be accurately obtained, all message queues can be closed in time when all Map tasks are completed, and the influence on subsequent tasks is avoided.

Fig. 7 is a schematic structural diagram of a data transmission device according to an embodiment of the present disclosure, and as shown in fig. 7, the data transmission device 70 includes:

a first establishing module 71, configured to establish a message queue.

And a data adding module 72, configured to add output data of the Map task to the message queue when it is monitored that the Map task is completed.

And the data transmission module 73 is used for transmitting the output data to the corresponding Reduce task according to the sorting of the output data in the message queue.

In a possible implementation, the first establishing module 71 may be configured to:

In a possible implementation manner, the output data of the Map task includes a field corresponding to the Map task and an execution result of the Map task; a data adding module 72 comprising:

and the determining unit is used for determining a target message queue corresponding to the Map task according to the field corresponding to the Map task, wherein the target message queue is a message queue corresponding to a Reduce task for processing the Map task.

In a possible implementation, the determining unit is configured to:

calculating the remainder of the quotient of the hash value of the field and the Reduce task number; and

and determining a target message queue corresponding to the Map task based on the corresponding relation between the remainder and the message queue.

In a possible embodiment, the apparatus further comprises:

and the data transmission module is used for transmitting the output data to the corresponding Reduce task when the Reduce task is in an idle state or finishes running.

The apparatus provided in this embodiment can execute the method in any one of the embodiments in fig. 2 to fig. 6, and the execution manner and the beneficial effects are similar, and are not described herein again.

Fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present disclosure, and as shown in fig. 8, the computer device 80 includes: a memory 81 and a processor 82;

the memory 81 stores a computer program, and when the computer program is executed by the processor 82, the method of any one of the embodiments in fig. 2 to 6 is implemented, and the execution mode and the beneficial effect are similar, which are not described herein again.

The embodiments of the present disclosure further provide a computer-readable storage medium, where a computer program is stored on the medium, and when the computer program is executed by a processor, the method in any of the embodiments in fig. 2 to fig. 6 is implemented, and the execution manner and the beneficial effect are similar, and are not described herein again.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data transmission, comprising:

establishing a message queue;

in response to the fact that the Map task is monitored to be completed, adding output data of the Map task into the message queue;

and transmitting the output data to a corresponding Reduce task according to the sequence of the output data in the message queue.

2. The method of claim 1, wherein the establishing a message queue comprises:

establishing corresponding number of partitions based on the number of Reduce tasks;

and establishing a message queue in each partition, and establishing a one-to-one correspondence between the message queue and the Reduce task.

3. The method according to claim 2, wherein the output data of the Map task comprises a field corresponding to the Map task and an execution result of the Map task;

adding the output data of the Map task into the message queue, including:

determining a target message queue corresponding to the Map task according to the field corresponding to the Map task, wherein the target message queue is a message queue corresponding to a Reduce task for processing the Map task;

and adding the output data of the Map task into the target message queue.

4. The method according to claim 3, wherein the determining a target message queue corresponding to the Map task according to the field corresponding to the Map task comprises:

calculating the remainder of the quotient of the hash value of the field and the Reduce task number;

5. The method of claim 1, wherein prior to adding output data of the Map task to the message queue in response to monitoring completion of the Map task, the method further comprises:

6. The method of claim 5, wherein after the adding the output data of the Map task to the message queue, the method further comprises:

and updating the number of the uncompleted tasks in the Map task list.

7. The method of claim 6, wherein after said transmitting said output data to a corresponding Reduce task, said method further comprises:

8. The method of any of claims 1-7, wherein prior to said transmitting said output data to a corresponding Reduce task, said method further comprises:

detecting whether the Reduce task is in an idle state or finishes running;

transmitting the output data to a corresponding Reduce task, comprising:

and if the Reduce task is in an idle state or finishes running, transmitting the output data to the corresponding Reduce task.

9. A data transmission apparatus, comprising:

the first establishing module is used for establishing a message queue;

the data adding module is used for adding the output data of the Map task into the message queue when the completion of the Map task is monitored;

and the data transmission module is used for transmitting the output data to the corresponding Reduce task according to the sequencing of the output data in the message queue.

10. The apparatus of claim 9, wherein the first establishing module is configured to:

11. The apparatus according to claim 10, wherein the output data of the Map task includes a field corresponding to the Map task and an execution result of the Map task;

the data adding module comprises:

12. The apparatus of claim 11, wherein the determining unit is configured to:

13. The apparatus of claim 9, further comprising:

14. The apparatus of claim 13, further comprising:

15. The apparatus of claim 14, further comprising:

16. The apparatus according to any one of claims 9-15, further comprising:

17. A computer device, comprising:

a memory and a processor;

wherein the memory has stored therein a computer program which, when executed by the processor, implements the method of any one of claims 1-8.

18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.