CN113761548A

CN113761548A - Data transmission method and device for Shuffle process

Info

Publication number: CN113761548A
Application number: CN202010536544.5A
Authority: CN
Inventors: 王文生; 石磊; 吴雪扬
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2021-12-07
Anticipated expiration: 2040-06-12
Also published as: CN113761548B

Abstract

The embodiment of the disclosure discloses a data transmission method and device for a Shuffle process. One embodiment of the method comprises: acquiring the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process, wherein the data set before the Shuffle process includes at least one partition; acquiring the number of target service components; determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process. The implementation mode realizes the decoupling of the Spark calculation process and the storage process, effectively avoids the re-execution of the Shuffle process caused by the local fault of the node, and improves the stability and the performance of Spark.

Description

Data transmission method and device for Shuffle process

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a data transmission method and device for a Shuffle process.

Background

With the rapid development of computer technology, general purpose computing engines suitable for large-scale data processing have come into existence. In the prior art, a Spark calculation engine is usually adopted to complete the interconversion of RDDs (flexible Distributed data sets) according to the dependency relationship. When the dependency relationship between RDDs before and after conversion belongs to wide dependency, the RDDs need to be subjected to Shuffle to complete the conversion.

Since the partitions of the Spark RDD are usually distributed on different computing nodes, when a node fails (including a network failure, a hard disk failure, etc.), a very serious error (FetchFailed) in Spark may be triggered, which results in that the Shuffle process is re-executed and consumes more computing resources and time. Moreover, because the number of random reads in the Shuffle Read process is related to the respective partition numbers of the two RDDs before and after the Shuffle process, under the condition that the partition number of the target RDD (RDD formed after Shuffle) is too large, the Read time is often too long due to too many Read times; if the number of partitions of the target RDD is too small, it is easy to cause an OOM (out of memory error) due to an excessive amount of data to be processed by each partition of the target RDD.

Disclosure of Invention

The embodiment of the disclosure provides a data transmission method and device for a Shuffle process.

In a first aspect, an embodiment of the present disclosure provides a data transmission method for a Shuffle process, where the method includes: acquiring the number of partitions included in a data set before a Shuffle process and a data set after the Shuffle process, wherein the data set before the Shuffle process includes at least one partition; acquiring the number of target service components; determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and respectively sending the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process.

In some embodiments, the determining, according to the number of the target service components, a correspondence between each target service component and a partition of the data set after the Shuffle process includes: acquiring the identifier of the partition of the data set after the Shuffle process; determining a hash value corresponding to the identifier of the partition of the data set after the Shuffle process according to a preset hash function; and determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.

In some embodiments, the predetermined hash function includes a modulo of the number of target service components.

In some embodiments, the method further comprises: and sending the identifier of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component.

In a second aspect, an embodiment of the present disclosure provides a data transmission apparatus for a Shuffle process, including: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is configured to acquire the number of partitions included in a data set before a Shuffle process and a data set after the Shuffle process, and the data set before the Shuffle process comprises at least one partition; a second acquisition unit configured to acquire the number of target service components; the determining unit is configured to determine the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and the first sending unit is configured to send the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process respectively.

In some embodiments, the determining unit includes: the obtaining module is configured to obtain the identifier of the partition of the data set after the Shuffle process; the first determining module is configured to determine a hash value corresponding to the identifier of the partition of the data set after the Shuffle process according to a preset hash function; and the second determining module is configured to determine the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.

In some embodiments, the predetermined hash function may include a modulo of the number of target service components.

In some embodiments, the apparatus further comprises: and the second sending unit is configured to send the identifier of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component.

In a third aspect, an embodiment of the present application provides a data transmission system for a Shuffle process, where the system includes: a data write terminal configured to perform a method as described in any implementation of the first aspect; the target service component is configured to respond to the received data sent by the data writing end and acquire the identification of the partition of the data set after the Shuffle process to which the received data belongs; and writing the received data into a target storage system according to the acquired identifier of the partition, wherein the target storage system comprises a data file with the identifier consistent with the identifier of the partition of the data set after the Shuffle process.

In some embodiments, the system further comprises: and the data reading end is configured to read data from the target storage system and generate a data set after the Shuffle process, wherein the identifier of the partition to which the data in the data set after the Shuffle process belongs is consistent with the identifier of the file of the data in the target system.

In some embodiments, the target storage system comprises a Hadoop distributed file system.

In a fourth aspect, an embodiment of the present application provides a server, where the server includes: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fifth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

According to the data transmission method and device for the Shuffle process, the number of the partitions included in the data set before the Shuffle process and the data set after the Shuffle process is obtained. Wherein the data set before the Shuffle process comprises at least one partition. Then, the number of target service components is obtained. And then, according to the number of the target service components, determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process. And finally, respectively sending the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process. Therefore, the decoupling of the Spark calculation process and the storage process is realized, the re-execution of the Shuffle process caused by the local fault of the node is effectively avoided, and the stability and the performance of Spark are improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a data transmission method for a Shuffle process according to the present disclosure;

fig. 3 is a schematic diagram of an application scenario of a data transmission method for the Shuffle process according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of yet another embodiment of a data transmission method for a Shuffle process according to the present disclosure;

FIG. 5 is a schematic structural diagram of one embodiment of a data transmission device for the Shuffle process according to the present disclosure;

FIG. 6 is a timing diagram of the interaction between various devices in one embodiment of a data transmission system for the Shuffle process in accordance with the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows an exemplary architecture 100 to which the data transmission method for the Shuffle process or the data transmission apparatus for the Shuffle process of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104,

servers

1051, 1052, 1053, and a server 106. The network 104 serves to provide a medium of communication links between the

terminal devices

101, 102, 103 and the server cluster 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 may interact with the server cluster 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a web browser application, a shopping-type application, a search-type application, an instant messaging tool, a mailbox client, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting data transmission, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server cluster 105 may comprise servers providing various services, for example Spark servers providing support for applications running on the

terminal devices

101, 102, 103. The Spark task scheduling server 1051 may generate Map (mapping) tasks and Reduce (reduction) tasks from the received data sets. The Map task and Reduce task may be performed by the server 1052 and the server 1053, respectively. Server 1052 may process the received data set and send the processed data to server 106. The server 106 may typically run a target service component.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the data transmission method for the Shuffle process provided by the embodiment of the present disclosure is generally executed by the server 1052, and accordingly, the data transmission device for the Shuffle process is generally disposed in the server 1052.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a data transmission method for the Shuffle process in accordance with the present disclosure is shown. The data transmission method for the Shuffle process comprises the following steps:

step 201, acquiring the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process.

In this embodiment, an execution subject (e.g., the server 1052 shown in fig. 1) of the data transmission method for the Shuffle process may acquire the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process in a wired connection manner or a wireless connection manner. The data set before the Shuffle process may include a parent RDD. The data set after the Shuffle process described above may include a child RDD. The data set before the Shuffle process may include at least one partition.

Specifically, the execution body may acquire a data set before the Shuffle process, which is stored locally in advance, or may acquire the data set before the Shuffle process from an electronic device (for example, a terminal device) communicatively connected to the execution body. The execution subject may obtain the partition number included in the data set after the Shuffle process from a task scheduler (e.g., a server 1051 shown in fig. 1) of Spark.

Step 202, obtain the number of target service components.

In this embodiment, the execution main body may obtain the number of the target service components through a wired connection manner or a wireless connection manner. The target service component may be configured to store the received data in a manner consistent with a partition of the data set after the Shuffle process. The number of target service components may be preconfigured.

Step 203, determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components.

In this embodiment, according to the number of the target service components acquired in step 201, the execution subject may determine, in various ways, a corresponding relationship between each target service component and a partition of the data set after the Shuffle process. As an example, the execution agent may determine a correspondence between each target service component and a partition of the data set after the Shuffle process according to a pre-configured correspondence table. As another example, the execution agent may also randomly determine a correspondence between each target service component and a partition of the data set after the Shuffle process. Generally, each partition of the data set after the Shuffle process corresponds to only one target service component, that is, the partitions corresponding to the target service components do not overlap with each other.

In some optional implementation manners of this embodiment, the execution subject may further determine a correspondence relationship between each target service component and a partition of the data set after the Shuffle process by:

firstly, acquiring the identifier of the partition of the data set after the Shuffle process.

In these implementations, the execution subject may obtain the identifier of the partition of the data set after the Shuffle process in a wired connection manner or a wireless connection manner. The identifier of the partition may include various forms, such as numbers, letters, and the like.

And secondly, determining a hash value corresponding to the identifier of the partition of the data set after the Shuffle process according to a preset hash function.

In these implementations, the execution subject may determine, according to a preset hash function, a hash value corresponding to an identifier of a partition of the data set after the Shuffle process.

Optionally, the predetermined hash function may include a modulo of the number of target service components based on the identification of the partition being in a digital form. As an example, the partition identifications of the child RDDs may be 1, 2, 3, 4, respectively. The number of the target service components may be 2. The predetermined hash function may be modulo-2. The hash values corresponding to the partition identifiers of the child RDDs are 1,0,1, and 0, respectively.

And thirdly, determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.

In these implementations, according to the hash value determined in the second step, the execution subject may determine, in various ways, a correspondence between each target service component and a partition of the data set after the Shuffle process. By way of example, partitions whose resulting hash values are consistent may be assigned to the same target service component.

And 204, respectively sending the data of each partition in the data set before the Shuffle process to the target service assembly corresponding to the data set after the Shuffle process.

In this embodiment, the execution subject may send the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process in various manners. As an example, the data set before the Shuffle process has 6 partitions, the data set after the Shuffle process has 4 partitions, the 1 st and 3 rd partitions in the data set after the Shuffle process correspond to the target service component a, and the 2 nd and 4 th partitions in the data set after the Shuffle process correspond to the target service component B. Then, the executing agent may send data corresponding to the 1 st and 3 rd partitions in the data set after the Shuffle process in each partition of the data set before the Shuffle process to the target service component a; and sending the data corresponding to the 2 nd and 4 th partitions in the data set after the Shuffle process in each partition of the data set before the Shuffle process to a target service component B.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a data transmission method for the Shuffle process according to an embodiment of the present disclosure. In the application scenario of fig. 3, in a Map (Map) phase of Spark, the server (administrator) may obtain the partition numbers (e.g. 2) included in RDD1 (i.e. parent RDD, such as 301 shown in fig. 3) and RDD2 (i.e. child RDD, such as 302 shown in fig. 3). Among them, RDD1 includes partition 1, partition 2, and partition 3. Included in partition 1 of RDD1 are data set 3011 corresponding to partition 1 of RDD2 and data set 3012 corresponding to partition 2 of RDD 2. Included in partition 2 of RDD1 is data set 3013 corresponding to partition 1 of RDD2 and data set 3014 corresponding to partition 2 of RDD 2. Included in partition 3 of RDD1 is data set 3015 corresponding to partition 1 of RDD2 and data set 3016 corresponding to partition 2 of RDD 2. The server may determine the number of target service components (e.g., 2) from target service component a (e.g., 3031 shown in fig. 3) and target service component B (e.g., 3032 shown in fig. 3). Thereafter, the server may determine that target service component a and target service component B correspond to partition 1 and partition 2 of RDD2, respectively. Then, the server may send the

data sets

3011, 3013, and 3015 corresponding to partition 1 of RDD2 in the partitions of RDD1 to the target service component a; and sending the

data sets

3012, 3014 and 3016 corresponding to the partition 2 of the RDD2 in the partitions of the RDD1 to the target service component B.

At present, one of the prior arts generally writes the generated calculation result file into the local disk during the Shuffle Write process, which results in triggering the Shuffle re-execution when the above-mentioned execution node fails. In the method provided by the embodiment of the disclosure, the data of each partition in the data set before the Shuffle process is respectively sent to the target service component corresponding to the data set after the Shuffle process, so that the Spark calculation process is decoupled from the storage process. Therefore, the method effectively avoids the replay of the Shuffle process caused by the local fault of the node, and improves the stability and performance of Spark.

With further reference to fig. 4, a flow 400 of yet another embodiment of a data transmission method for the Shuffle process is shown. The flow 400 of the data transmission method for the Shuffle process includes the following steps:

step 401, acquiring the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process.

Step 402, obtaining the number of target service components.

Step 403, determining the corresponding relationship between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components.

And step 404, respectively sending the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process.

Step 401, step 402, step 403, and step 404 are respectively consistent with step 201, step 202, step 203, step 204, and optional implementations thereof in the foregoing embodiment, and the above description of step 201, step 202, step 203, step 204, and optional implementations thereof also applies to step 401, step 402, step 403, and step 404, which is not described herein again.

Step 405, sending the identifier of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component.

In this embodiment, the execution subject (for example, the server 105 shown in fig. 1) of the data transmission method for the Shuffle process may further send the identifier of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component. For example, referring to step 204 in the foregoing embodiment, the executing entity may further send information characterizing that the sent data belongs to the 1 st and 3 rd partitions of the child RDD to the target service component a; and sending information for representing that the sent data belongs to the 2 nd and 4 th partitions of the sub RDD to a target service component B.

As can be seen from fig. 4, the flow 400 of the data transmission method for the Shuffle process in this embodiment embodies a step of sending the identifier of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component. Therefore, the scheme described in this embodiment can directly send the determined correspondence between the data in each partition of the data set before the Shuffle process and each partition of the data set after the Shuffle process to the target service component, thereby simplifying the calculation logic of the target service component and improving the overall efficiency of the Shuffle process.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a data transmission apparatus for a Shuffle process, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2 or fig. 4, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the data transmission apparatus 500 for Shuffle process provided in this embodiment includes a first obtaining unit 501, a second obtaining unit 502, a determining unit 503, and a first sending unit 504. The first obtaining unit 501 is configured to obtain the number of partitions included in a data set before a Shuffle process and a data set after the Shuffle process, where the data set before the Shuffle process includes at least one partition; a second obtaining unit 502 configured to obtain the number of target service components; a determining unit 503, configured to determine, according to the number of the target service components, a corresponding relationship between each target service component and a partition of the data set after the Shuffle process; a first sending unit 504, configured to send the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process, respectively.

In this embodiment, in the data transmission apparatus 500 for Shuffle process: the detailed processing and the technical effects of the first obtaining unit 501, the second obtaining unit 502, the determining unit 503 and the first sending unit 504 may refer to the related descriptions of step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, and are not described herein again.

In some optional implementations of this embodiment, the determining unit 503 may include: an acquisition module (not shown), a first determination module (not shown), and a second determination module (not shown). The obtaining module may be configured to obtain an identifier of a partition of the data set after the Shuffle process. The first determining module may be configured to determine, according to a preset hash function, a hash value corresponding to an identifier of a partition of the data set after the Shuffle process. The second determining module may be configured to determine, according to the hash value, a correspondence between each target service component and a partition of the data set after the Shuffle process.

In some optional implementations of this embodiment, the preset hash function may include a modulo of the number of target service components.

In some optional implementations of this embodiment, the data transmission apparatus 500 for the Shuffle process may further include a second sending unit (not shown in the figure). The second sending unit may be configured to send, to the corresponding target service component, an identifier of a partition of the data set after the Shuffle process to which the sent data belongs.

In the apparatus provided by the foregoing embodiment of the present disclosure, the first obtaining unit 501 obtains the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process. Wherein the data set before the Shuffle process comprises at least one partition. Then, the second obtaining unit 502 obtains the number of target service components. Then, according to the number of the target service components, the determining unit 503 determines the corresponding relationship between each target service component and the partition of the data set after the Shuffle process. Finally, the first sending unit 504 sends the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process. Therefore, the decoupling of the Spark calculation process and the storage process is realized, the re-execution of the Shuffle process caused by the local fault of the node is effectively avoided, and the stability and the performance of Spark are improved.

With further reference to fig. 6, a timing sequence 600 of interactions between various devices in one embodiment of a data transmission system for the Shuffle process is shown. The data transmission system for the Shuffle process may include: data write-in (e.g., server 1052, shown in FIG. 1), target service component (e.g., server 106, shown in FIG. 1). Wherein, the data write terminal may be configured to implement the data transmission method for the Shuffle process as described in the foregoing embodiments. The target service component may be configured to, in response to receiving data sent by the data write end, obtain an identifier of a partition of a post-Shuffle data set to which the received data belongs; and writing the received data into the target storage system according to the acquired identification of the partition. The target storage system may include a data file whose identifier is consistent with the identifier of the partition of the data set after the Shuffle process.

In some optional implementation manners of this embodiment, the data transmission system for the Shuffle process may further include: and the data reading end is configured to read data from the target storage system and generate a data set after the Shuffle process. The identifier of the partition to which the data in the data set after the Shuffle process belongs is usually consistent with the identifier of the file of the data in the target system.

In some optional implementations of this embodiment, the target storage System may include a Hadoop Distributed File System (HDFS).

As shown in fig. 6, in step 601, the data write end acquires the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process.

In step 602, the data write terminal obtains the number of target service components.

In step 603, according to the number of the target service components, the data write end determines the corresponding relationship between each target service component and the partition of the data set after the Shuffle process.

In step 604, the data write-in terminal sends the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process, respectively.

The steps 601-604 are respectively consistent with the steps 201-204 and the optional implementation manner thereof in the foregoing embodiment, and the above description regarding the steps 201-204 and the optional implementation manner thereof is also applicable to the steps 601-604, which is not described herein again.

In step 605, in response to receiving the data sent by the data write end, the target service component obtains an identification of a partition of the post-Shuffle data set to which the received data belongs.

In this embodiment, in response to receiving data sent by the data write end, the execution main body may acquire, in various ways, an identifier of a partition of a data set after a Shuffle process to which the received data belongs. As an example, the execution body may obtain, from a data write end that sends data, an identification of a partition of a data set after a Shuffle process to which the received data belongs. As yet another example, the executing agent may determine the corresponding relationship between the received data and the partition of the data set after the Shuffle process in a manner consistent with step 203 in the foregoing embodiment. Then, the execution subject may determine, according to the correspondence, an identifier of a partition of the data set after the Shuffle process to which the received data belongs.

In step 606, the target service component writes the received data to the target storage system based on the obtained identification of the partition.

In this embodiment, the target service component may write the received data to the target storage system in various ways according to the identifier of the partition acquired in step 605. The target storage system may include various remote storage systems that provide redundant backup. The target storage system may include a data file whose identifier is consistent with the identifier of the partition of the data set after the Shuffle process. By way of example, the identification of the partition of the data set after the Shuffle process may be "partition 1" or "partition 2". The target service component can write the received data into the partition 1.data file and the partition 2.data file in the target storage system respectively.

Optionally, the target storage system may include a Hadoop distributed file system.

In some optional implementation manners of this embodiment, in an embodiment of the data transmission method for the Shuffle process, the time sequence of interaction between the devices may further include step 607. In step 607, data is read from the target storage system to generate a data set after the Shuffle process.

In these implementations, a data reading end (e.g., the server 1053 shown in fig. 1) may read data from the target storage system, and generate a data set after the Shuffle process. The identifier of the partition to which the data in the data set after the Shuffle process belongs may be consistent with the identifier of the file of the data in the target system.

In the data transmission system for the Shuffle process provided in the above embodiment of the application, first, the data write end acquires the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process. Then, the data write terminal acquires the number of the target service components. And then, according to the number of the target service components, the data write-in end determines the corresponding relation between each target service component and the partition of the data set after the Shuffle process. And then, the data writing end respectively sends the data of each partition in the data set before the Shuffle process to the target service assembly corresponding to the data set after the Shuffle process. And then, responding to the received data sent by the data writing end, the target service component acquires the identification of the partition of the data set after the Shuffle process to which the received data belongs. Then, according to the obtained identification of the partition, the target service component writes the received data into the target storage system. Therefore, by decoupling the Spark calculation process and the storage process, the replay of the Shuffle process caused by the local fault of the node is effectively avoided. Moreover, the target service component directly aggregates the RDDs to be generated in the Shuffle Read process according to the partitions, namely, one sub RDD partition is stored corresponding to one file, so that the Shuffle Read speed is improved, and random reading can be avoided. Thereby improving the stability and performance of Spark.

Referring now to FIG. 7, shown is a schematic diagram of an electronic device (e.g., server 1052 in FIG. 1) 700 suitable for use in implementing embodiments of the present application. The server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present application.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the server; or may exist separately and not be assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: acquiring the number of partitions included in a data set before a Shuffle process and a data set after the Shuffle process, wherein the data set before the Shuffle process includes at least one partition; acquiring the number of target service components; determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and respectively sending the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprises a first acquisition unit, a second acquisition unit, a determination unit and a first sending unit. For example, the first obtaining unit may also be described as a unit for obtaining the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process, where the data set before the Shuffle process includes at least one partition.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A data transmission method for a Shuffle process comprises the following steps:

acquiring the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process, wherein the data set before the Shuffle process includes at least one partition;

acquiring the number of target service components;

determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components;

and respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process.

2. The method according to claim 1, wherein the determining, according to the number of the target service components, a correspondence between each target service component and a partition of the data set after the Shuffle process includes:

acquiring the identifier of the partition of the data set after the Shuffle process;

determining a hash value corresponding to the identifier of the partition of the data set after the Shuffle process according to the preset hash function;

and determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.

3. The method of claim 2, wherein the preset hash function comprises a modulo of the number of target service components.

4. The method according to one of claims 1-3, wherein the method further comprises:

and sending the identifier of the partition of the data set after the Shuffle process, to which the sent data belongs, to the corresponding target service component.

5. A data transmission apparatus for a Shuffle process, comprising:

a first obtaining unit, configured to obtain the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process, where the data set before the Shuffle process includes at least one partition;

a second acquisition unit configured to acquire the number of target service components;

the determining unit is configured to determine the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components;

and the first sending unit is configured to send the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process respectively.

6. A data transmission system for a Shuffle process, comprising:

a data write terminal configured to perform implementing the method of any one of claims 1-4;

the target service component is configured to respond to the received data sent by the data writing end and acquire the identification of the partition of the post-Shuffle data set to which the received data belongs; and writing the received data into a target storage system according to the acquired identifier of the partition, wherein the target storage system comprises a data file with the identifier consistent with the identifier of the partition of the data set after the Shuffle process.

7. The system of claim 6, wherein the system further comprises:

and the data reading end is configured to read data from the target storage system and generate a data set after the Shuffle process, wherein the identifier of the partition to which the data in the data set after the Shuffle process belongs is consistent with the identifier of the file of the data in the target system.

8. The system of claim 6 or 7, wherein the target storage system comprises a Hadoop distributed file system.

9. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.