CN113761548B - Data transmission method and device for Shuffle process - Google Patents

Data transmission method and device for Shuffle process Download PDF

Info

Publication number
CN113761548B
CN113761548B CN202010536544.5A CN202010536544A CN113761548B CN 113761548 B CN113761548 B CN 113761548B CN 202010536544 A CN202010536544 A CN 202010536544A CN 113761548 B CN113761548 B CN 113761548B
Authority
CN
China
Prior art keywords
data set
partition
data
shuffle
shuffle process
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010536544.5A
Other languages
Chinese (zh)
Other versions
CN113761548A (en
Inventor
王文生
石磊
吴雪扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010536544.5A priority Critical patent/CN113761548B/en
Publication of CN113761548A publication Critical patent/CN113761548A/en
Application granted granted Critical
Publication of CN113761548B publication Critical patent/CN113761548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/606Protecting data by securing the transmission between two devices or processes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1044Group management mechanisms 

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure discloses a data transmission method and a data transmission device for a Shuffle process. One embodiment of the method comprises the following steps: acquiring the number of partitions included in a data set before the Shuffle process and a data set after the Shuffle process, wherein the data set before the Shuffle process comprises at least one partition; acquiring the number of target service components; determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process. The embodiment realizes decoupling of the Spark computing process and the storage process, effectively avoids re-execution of the Shuffle process caused by local faults of the nodes, and improves the stability and performance of Spark.

Description

Data transmission method and device for Shuffle process
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a data transmission method and device for a Shuffle process.
Background
With the rapid development of computer technology, general-purpose computing engines suitable for large-scale data processing have been developed. The prior art generally uses Spark computing engines to perform the interconversion of RDD (Resilient Distributed Datasets, resilient distributed data set) according to dependencies. When the dependency relationship between RDD before and after conversion belongs to wide dependency, the conversion of RDD needs to be completed through Shuffle.
Since the partitions of Spark RDD are typically distributed on different computing nodes, when a node fails (including network failure, hard disk failure, etc.), a very serious error (FetchFailed) in Spark is triggered, resulting in re-executing the Shuffle process and consuming more computing resources and time. Moreover, since the number of random reads in the Shuffle Read process is related to the number of partitions of each of the two RDDs before and after the Shuffle process, when the number of partitions of the target RDD (RDD formed after Shuffle) is too large, the Read time is often too long due to the too large number of reads; if the number of partitions of the target RDD is too small, the amount of data handled by each partition of the target RDD is too large, which easily causes an OOM (out memory error).
Disclosure of Invention
The embodiment of the disclosure provides a data transmission method and a data transmission device for a Shuffle process.
In a first aspect, embodiments of the present disclosure provide a data transmission method for a Shuffle procedure, the method including: acquiring a data set before a Shuffle process and the number of partitions included in the data set after the Shuffle process, wherein the data set before the Shuffle process comprises at least one partition; acquiring the number of target service components; determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process.
In some embodiments, determining the correspondence between each target service component and the partition of the data set after the Shuffle procedure according to the number of target service components includes: acquiring the identification of the partition of the data set after the Shuffle process; determining a hash value corresponding to the identification of the partition of the data set after the Shuffle process according to a preset hash function; and determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.
In some embodiments, the predetermined hash function includes modulo a number of target service components.
In some embodiments, the method further comprises: and sending the identification of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component.
In a second aspect, embodiments of the present disclosure provide a data transmission apparatus for a Shuffle procedure, the apparatus comprising: a first acquisition unit configured to acquire a data set before a Shuffle process and the number of partitions included in the data set after the Shuffle process, wherein the data set before the Shuffle process includes at least one partition; a second acquisition unit configured to acquire the number of target service components; the determining unit is configured to determine the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and the first sending unit is configured to send the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process.
In some embodiments, the determining unit includes: the acquisition module is configured to acquire the identification of the partition of the data set after the Shuffle process; the first determining module is configured to determine hash values corresponding to the identifiers of the partitions of the data set after the Shuffle process according to a preset hash function; and the second determining module is configured to determine the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.
In some embodiments, the preset hash function may include modulo the number of target service components.
In some embodiments, the apparatus further comprises: and the second sending unit is configured to send the identification of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component.
In a third aspect, embodiments of the present application provide a data transmission system for a Shuffle procedure, the system including: a data write configured to perform a method as described in any of the implementations of the first aspect; the target service component is configured to respond to the received data sent by the data writing end and acquire the identification of the partition of the data set after the Shuffle process to which the received data belongs; and writing the received data into a target storage system according to the obtained identification of the partition, wherein the target storage system comprises a data file with the identification consistent with the identification of the partition of the data set after the Shuffle process.
In some embodiments, the system further comprises: the data reading end is configured to read data from the target storage system and generate a data set after the Shuffle process, wherein the identification of the partition to which the data in the data set after the Shuffle process belongs is consistent with the identification of the file of the data in the target system.
In some embodiments, the target storage system comprises a Hadoop distributed file system.
In a fourth aspect, embodiments of the present application provide a server, including: one or more processors; a storage device having one or more programs stored thereon; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
The embodiment of the disclosure provides a data transmission method and device for a Shuffle process, which firstly obtains a data set before the Shuffle process and the number of partitions included in the data set after the Shuffle process. Wherein the data set before the Shuffle process includes at least one partition. Then, the number of target service components is obtained. And then, according to the number of the target service components, determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process. And finally, respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process. Therefore, decoupling of the Spark computing process and the storage process is realized, re-execution of the Shuffle process due to local faults of the nodes is effectively avoided, and stability and performance of the Spark are improved.
Drawings
Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:
FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;
FIG. 2 is a flow chart of one embodiment of a data transmission method for a Shuffle procedure in accordance with the present disclosure;
FIG. 3 is a schematic diagram of one application scenario of a data transmission method for a Shuffle procedure in accordance with an embodiment of the present disclosure;
FIG. 4 is a flow chart of yet another embodiment of a data transmission method for a Shuffle procedure in accordance with the present disclosure;
FIG. 5 is a schematic diagram of a structure of one embodiment of a data transmission apparatus for a Shuffle process in accordance with the present disclosure;
FIG. 6 is a timing diagram of interactions between various devices in one embodiment of a data transmission system for a Shuffle process in accordance with the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary architecture 100 to which the data transmission method for a Shuffle procedure or the data transmission apparatus for a Shuffle procedure of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, servers 1051, 1052, 1053, and a server 106. Network 104 is the medium used to provide communication links between terminal devices 101, 102, 103 and server cluster 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
Terminal devices 101, 102, 103 may interact with server cluster 105 through network 104 to receive or send messages, etc. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting data transmission, including but not limited to smartphones, tablet computers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.
The server cluster 105 may comprise servers providing various services, such as Spark servers providing support for applications running on the terminal devices 101, 102, 103. Spark task scheduling server 1051 may generate Map tasks and Reduce tasks from the received data set. The Map task and Reduce task described above may be performed by the server 1052 and the server 1053, respectively. Server 1052 may process the received data set and send the processed data to server 106. The server 106 may typically run target service components.
The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.
It should be noted that, the data transmission method for the Shuffle procedure provided by the embodiments of the present disclosure is generally performed by the server 1052, and accordingly, the data transmission device for the Shuffle procedure is generally disposed in the server 1052.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a data transmission method for a Shuffle procedure in accordance with the present disclosure is shown. The data transmission method for the Shuffle procedure includes the steps of:
step 201, the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process is obtained.
In this embodiment, the execution body (such as the server 1052 shown in fig. 1) of the data transmission method for the Shuffle procedure may acquire the number of partitions included in the data set before the Shuffle procedure and the data set after the Shuffle procedure by a wired connection method or a wireless connection method. Wherein, the data set before the above-mentioned Shuffle process may include a parent RDD. The data set after the above-described Shuffle process may include sub-RDDs. The data set before the above-described Shuffle procedure may include at least one partition (partition).
Specifically, the execution body may acquire a data set before the Shuffle process stored locally in advance, or may acquire the data set before the Shuffle process from an electronic device (for example, a terminal device) connected to the execution body in communication. The execution body may obtain the number of partitions included in the data set after the Shuffle procedure from the task scheduler (e.g., the server 1051 shown in fig. 1) of Spark.
Step 202, the number of target service components is obtained.
In this embodiment, the execution body may acquire the number of target service components through a wired connection manner or a wireless connection manner. The target service component may be configured to store the received data in a manner consistent with a partition of the data set after the Shuffle procedure. The number of target service components described above may be preconfigured.
Step 203, determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of target service components.
In this embodiment, according to the number of target service components acquired in step 201, the execution body may determine the correspondence between each target service component and the partition of the data set after the Shuffle procedure in various manners. As an example, the execution body may determine, according to a pre-configured correspondence table, a correspondence between each target service component and a partition of the data set after the Shuffle procedure. As yet another example, the execution body may also randomly determine a correspondence between each target service component and a partition of the data set after the Shuffle procedure. Typically, each partition of the data set after the above-mentioned Shuffle procedure corresponds to only one target service component, that is, there is typically no overlap between the partitions corresponding to the target service components.
In some optional implementations of this embodiment, the executing entity may further determine a correspondence between each target service component and a partition of the data set after the Shuffle procedure by:
first, the identification of the partition of the data set after the Shuffle process is obtained.
In these implementations, the executing body may obtain the identification of the partition of the data set after the Shuffle procedure through a wired connection manner or a wireless connection manner. The identification of the above-mentioned partitions may include various forms, such as numerals, letters, and the like.
And secondly, determining hash values corresponding to the identifiers of the partitions of the data set after the Shuffle process according to a preset hash function.
In these implementations, the executing body may determine, according to a preset hash function, a hash value corresponding to an identifier of a partition of the data set after the Shuffle procedure.
Optionally, the preset hash function may include modulo the number of target service components based on the identification of the partition being in digital form. As an example, the partition identifications of the above sub RDDs may be 1, 2, 3, 4, respectively. The number of the target service components may be 2. The predetermined hash function may be modulo 2. The hash values corresponding to the partition identifiers of the sub RDDs are 1,0,1 and 0, respectively.
And thirdly, determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.
In these implementations, the executing entity may determine, according to the hash value determined in the second step, a correspondence between each target service component and the partition of the data set after the Shuffle procedure in various manners. As an example, the resulting partitions with consistent hash values may be associated with the same target service component.
And 204, respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process.
In this embodiment, the execution body may send the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process in various manners. As an example, there are 6 partitions in the data set before the Shuffle process, and there are 4 partitions in the data set after the Shuffle process, where the 1 st and 3 rd partitions in the data set after the Shuffle process correspond to the target service component a, and the 2 nd and 4 th partitions in the data set after the Shuffle process correspond to the target service component B. The execution body may send data corresponding to the 1 st partition and the 3 rd partition in the data set after the Shuffle process to the target service component a; and sending data corresponding to the 2 nd and 4 th partitions in the data set after the Shuffle process in each partition of the data set before the Shuffle process to the target service component B.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a data transmission method for a Shuffle procedure according to an embodiment of the present disclosure. In the application scenario of fig. 3, in the Map (Map, i.e., shuffle Write) phase of Spark, the server (executor) may obtain the number of partitions (e.g., 2) that RDD1 (i.e., parent RDD, 301 shown in fig. 3) and RDD2 (i.e., child RDD, 302 shown in fig. 3) include. RDD1 includes partition 1, partition 2, and partition 3. Partition 1 of RDD1 includes data set 3011 corresponding to partition 1 of RDD2 and data set 3012 corresponding to partition 2 of RDD 2. Partition 2 of RDD1 includes data set 3013 corresponding to partition 1 of RDD2 and data set 3014 corresponding to partition 2 of RDD 2. Included in partition 3 of RDD1 are data set 3015 corresponding to partition 1 of RDD2 and data set 3016 corresponding to partition 2 of RDD 2. The server may determine the number of target service components (e.g., 2) based on target service component a (3031 shown in fig. 3) and target service component B (3032 shown in fig. 3). Thereafter, the server may determine that target service component A and target service component B correspond to partition 1 and partition 2 of RDD2, respectively. The server may then send the data sets 3011, 3013, 3015 corresponding to partition 1 of RDD2 in the partitions of RDD1 to the target service component a; the data sets 3012, 3014, 3016 in each partition of RDD1 corresponding to partition 2 of RDD2 are sent to the target service component B.
Currently, one of the existing technologies is generally to Write the generated calculation result file to the local disk during the Shuffle Write process, which results in triggering the re-execution of the Shuffle when the above-mentioned executing node fails. In the method provided by the embodiment of the disclosure, the data of each partition in the data set before the Shuffle process is respectively sent to the target service component corresponding to the data set after the Shuffle process, so that decoupling of the Spark computing process and the storage process is realized. Therefore, the re-execution of the Shuffle process caused by the local fault of the node is effectively avoided, and the stability and performance of Spark are improved.
With further reference to fig. 4, a flow 400 of yet another embodiment of a data transmission method for a Shuffle procedure is shown. The flow 400 of the data transmission method for the Shuffle procedure includes the following steps:
step 401, obtaining the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process.
Step 402, the number of target service components is obtained.
Step 403, determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of target service components.
And step 404, the data of each partition in the data set before the Shuffle process is respectively sent to the target service component corresponding to the data set after the Shuffle process.
The steps 401, 402, 403 and 404 are identical to the steps 201, 202, 203, 204 and optional implementations of the foregoing embodiments, and the descriptions of the steps 201, 202, 203, 204 and optional implementations of the steps are also applicable to the steps 401, 402, 403 and 404, which are not repeated herein.
Step 405, the identification of the partition of the data set after the Shuffle process to which the transmitted data belongs is transmitted to the corresponding target service component.
In this embodiment, the execution body of the data transmission method for the Shuffle procedure (for example, the server 105 shown in fig. 1) may also send the identification of the partition of the data set after the Shuffle procedure to which the sent data belongs to the corresponding target service component. As an example, referring to step 204 of the foregoing embodiment, the executing entity may further send information characterizing that the sent data belongs to the 1 st and 3 rd partitions of the sub RDD to the target service component a; and sending information representing that the sent data belongs to the 2 nd and 4 th partitions of the sub RDD to the target service component B.
As can be seen from fig. 4, a flow 400 of the data transmission method for the Shuffle procedure in the present embodiment represents a step of transmitting an identification of a partition of the data set after the Shuffle procedure to which the transmitted data belongs to a corresponding target service component. Therefore, the scheme described in this embodiment can directly send the corresponding relationship between the data in each partition of the data set before the determined Shuffle process and each partition of the data set after the Shuffle process to the target service component, so as to simplify the calculation logic of the target service component and improve the overall efficiency of the Shuffle process.
With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a data transmission apparatus for a Shuffle procedure, which corresponds to the method embodiment shown in fig. 2 or fig. 4, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the data transmission apparatus 500 for a Shuffle procedure provided in the present embodiment includes a first acquisition unit 501, a second acquisition unit 502, a determination unit 503, and a first transmission unit 504. Wherein, the first obtaining unit 501 is configured to obtain a data set before the Shuffle process and the number of partitions included in the data set after the Shuffle process, where the data set before the Shuffle process includes at least one partition; a second acquisition unit 502 configured to acquire the number of target service components; a determining unit 503 configured to determine, according to the number of target service components, a correspondence between each target service component and a partition of the data set after the Shuffle process; the first sending unit 504 is configured to send the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process, respectively.
In the present embodiment, in the data transmission apparatus 500 for the Shuffle procedure: the specific processing of the first acquiring unit 501, the second acquiring unit 502, the determining unit 503 and the first transmitting unit 504 and the technical effects thereof may refer to the descriptions related to step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, and are not repeated herein.
In some optional implementations of this embodiment, the determining unit 503 may include: an acquisition module (not shown), a first determination module (not shown), and a second determination module (not shown). The acquiring module may be configured to acquire the identification of the partition of the data set after the Shuffle procedure. The first determining module may be configured to determine, according to a preset hash function, a hash value corresponding to an identifier of a partition of the data set after the Shuffle process. The second determining module may be configured to determine, according to the hash value, a correspondence between each target service component and a partition of the data set after the Shuffle procedure.
In some optional implementations of this embodiment, the preset hash function may include modulo the number of target service components.
In some alternative implementations of the present embodiment, the data transmission apparatus 500 for a Shuffle procedure may further include a second transmitting unit (not shown in the drawing). The second sending unit may be configured to send, to the corresponding target service component, an identifier of a partition of the data set after the Shuffle procedure to which the sent data belongs.
The apparatus provided in the above embodiment of the present disclosure acquires, by the first acquiring unit 501, the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process. Wherein the data set before the Shuffle process includes at least one partition. Then, the second acquisition unit 502 acquires the number of target service components. Then, according to the number of target service components, the determining unit 503 determines the correspondence between each target service component and the partition of the data set after the Shuffle process. Finally, the first sending unit 504 sends the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process. Therefore, decoupling of the Spark computing process and the storage process is realized, re-execution of the Shuffle process due to local faults of the nodes is effectively avoided, and stability and performance of the Spark are improved.
With further reference to fig. 6, illustrated is a timing 600 for interaction between various devices in one embodiment of a data transmission system for a Shuffle procedure. The data transmission system for the Shuffle procedure may include: a data write (e.g., server 1052, shown in fig. 1), and a target service component (e.g., server 106, shown in fig. 1). The data writing end may be configured to implement the data transmission method for the Shuffle procedure as described in the foregoing embodiments. The target service component may be configured to obtain, in response to receiving data sent by the data writing end, an identifier of a partition of a data set after a Shuffle process to which the received data belongs; and writing the received data into a target storage system according to the acquired identification of the partition. The target storage system may include a data file whose identification matches the identification of the partition of the data set after the Shuffle process.
In some optional implementations of the present embodiments, the data transmission system for a Shuffle procedure may further include: the data reading end is configured to read data from the target storage system and generate a data set after the Shuffle process. The identification of the partition to which the data in the data set after the Shuffle process belongs is generally consistent with the identification of the file of the data in the target system.
In some alternative implementations of the present embodiment, the target storage system may include a Hadoop distributed file system (HDFS, hadoop Distributed File System).
As shown in fig. 6, in step 601, the data writing end acquires the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process.
In step 602, the data writing end obtains the number of target service components.
In step 603, according to the number of target service components, the data writing end determines the correspondence between each target service component and the partition of the data set after the Shuffle process.
In step 604, the data writing end sends the data of each partition in the data set before the Shuffle process to the target service component corresponding to the data set after the Shuffle process.
Steps 601 to 604 are identical to steps 201 to 204 and their alternative implementation in the foregoing embodiments, respectively, and the descriptions of steps 201 to 204 and their alternative implementation are also applicable to steps 601 to 604, which are not repeated here.
In step 605, in response to receiving the data sent by the data writing end, the target service component obtains the identification of the partition of the data set after the Shuffle process to which the received data belongs.
In this embodiment, in response to receiving data sent by the data writing end, the execution body may obtain, in various manners, an identification of a partition of the data set after the Shuffle process to which the received data belongs. As an example, the execution body may acquire, from the data writing end that sends the data, an identification of a partition of the data set after the Shuffle process to which the received data belongs. As yet another example, the execution body may determine the correspondence between the received data and the partition of the data set after the Shuffle process in a manner consistent with step 203 in the foregoing embodiments. Then, the executing body may determine, according to the correspondence, an identification of a partition of the data set after the Shuffle process to which the received data belongs.
In step 606, the target service component writes the received data to the target storage system based on the identification of the acquired partition.
In this embodiment, the target service component may write the received data to the target storage system in various ways according to the identification of the partition obtained in step 605. The target storage system may include various remote storage systems that provide redundant backup. The target storage system may include a data file having an identification that is consistent with an identification of a partition of the data set after the Shuffle process. As an example, the identification of the partition of the data set after the Shuffle process may be "partition 1", "partition 2". The target service component can write the received data into a partition 1.Data file and a partition 2.Data file in the target storage system respectively.
Alternatively, the target storage system may comprise a Hadoop distributed file system.
In some optional implementations of this embodiment, the timing of the interaction between the devices in one embodiment of the data transmission method for the Shuffle procedure may further include step 607. In step 607, data is read from the target storage system, generating a data set after the Shuffle process.
In these implementations, a data reading terminal (e.g., server 1053 shown in fig. 1) may read data from the target storage system described above, generating a data set after the Shuffle process. The identification of the partition to which the data in the data set after the Shuffle process belongs may be consistent with the identification of the file of the data in the target system.
In the data transmission system for a Shuffle process provided in the above embodiment of the present application, first, a data writing terminal obtains a data set before the Shuffle process and the number of partitions included in the data set after the Shuffle process. Then, the data writing end obtains the number of the target service components. And then, according to the number of the target service components, the data writing end determines the corresponding relation between each target service component and the partition of the data set after the Shuffle process. And then, the data writing end respectively sends the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process. And then, in response to receiving the data sent by the data writing end, the target service component acquires the identification of the partition of the data set after the Shuffle process to which the received data belongs. Then, the target service component writes the received data to the target storage system according to the obtained identification of the partition. Therefore, through decoupling the Spark computing process and the storage process, the re-execution of the Shuffle process caused by the local fault of the node is effectively avoided. In addition, the target service component directly aggregates the RDDs to be generated in the Shuffle Read process according to the partitions, namely one sub RDD partition is stored corresponding to one file, so that the speed of the Shuffle Read is improved and random reading can be avoided. Thereby improving the stability and performance of Spark.
Referring now to FIG. 7, a schematic diagram of an electronic device (e.g., server 1052 of FIG. 1) 700 suitable for use in implementing embodiments of the present application is shown. The server illustrated in fig. 7 is merely an example, and should not impose any limitation on the functionality and scope of use of embodiments of the present application.
As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a central processor, a graphics processor, etc.) 701, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the electronic device 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
In general, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, etc.; an output device 707 including, for example, a liquid crystal display (LCD, liquid Crystal Display), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 shows an electronic device 700 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 7 may represent one device or a plurality of devices as needed.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 709, or installed from storage 708, or installed from ROM 702. When being executed by the processing means 701, performs the above-described functions defined in the methods of the embodiments of the present application.
It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (Radio Frequency), and the like, or any suitable combination thereof.
The computer readable medium may be contained in the server; or may exist alone without being assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: acquiring a data set before a Shuffle process and the number of partitions included in the data set after the Shuffle process, wherein the data set before the Shuffle process comprises at least one partition; acquiring the number of target service components; determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components; and respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process.
Computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments described in the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes a first acquisition unit, a second acquisition unit, a determination unit, and a first transmission unit. The names of these units do not limit the unit itself in some cases, for example, the first acquisition unit may also be described as "a unit that acquires the number of partitions included in the data set before the Shuffle process and the data set after the Shuffle process, where the data set before the Shuffle process includes at least one partition".
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims (10)

1. A data transmission method for a Shuffle procedure, comprising:
acquiring the number of partitions included in a data set before the Shuffle process and a data set after the Shuffle process, wherein the data set before the Shuffle process comprises at least one partition;
acquiring the number of target service components;
determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the number of the target service components;
and respectively sending the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process.
2. The method of claim 1, wherein the determining, according to the number of target service components, a correspondence between each target service component and a partition of the data set after the Shuffle process comprises:
acquiring the identification of the partition of the data set after the Shuffle process;
determining a hash value corresponding to the identification of the partition of the data set after the Shuffle process according to a preset hash function;
and determining the corresponding relation between each target service component and the partition of the data set after the Shuffle process according to the hash value.
3. The method of claim 2, wherein the pre-set hash function comprises modulo a number of the target service components.
4. A method according to one of claims 1-3, wherein the method further comprises:
and sending the identification of the partition of the data set after the Shuffle process to which the sent data belongs to the corresponding target service component.
5. A data transmission apparatus for a Shuffle process, comprising:
a first obtaining unit configured to obtain a data set before the Shuffle process and a partition number included in the data set after the Shuffle process, wherein the data set before the Shuffle process includes at least one partition;
a second acquisition unit configured to acquire the number of target service components;
a determining unit configured to determine, according to the number of target service components, a correspondence between each target service component and a partition of the data set after the Shuffle procedure;
and the first sending unit is configured to send the data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process.
6. A data transmission system for a Shuffle process, comprising:
a data writing terminal configured to perform a method implementing any of claims 1-4;
the target service component is configured to respond to the received data sent by the data writing end and acquire the identification of the partition of the data set after the Shuffle process to which the received data belongs; and writing the received data into a target storage system according to the obtained identification of the partition, wherein the target storage system comprises a data file with the identification consistent with the identification of the partition of the data set after the Shuffle process.
7. The system of claim 6, wherein the system further comprises:
the data reading end is configured to read data from the target storage system and generate a data set after the Shuffle process, wherein the identification of the partition to which the data in the data set after the Shuffle process belongs is consistent with the identification of the file of the data in the target storage system.
8. The system of claim 6 or 7, wherein the target storage system comprises a Hadoop distributed file system.
9. A server, comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-4.
10. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-4.
CN202010536544.5A 2020-06-12 2020-06-12 Data transmission method and device for Shuffle process Active CN113761548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010536544.5A CN113761548B (en) 2020-06-12 2020-06-12 Data transmission method and device for Shuffle process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010536544.5A CN113761548B (en) 2020-06-12 2020-06-12 Data transmission method and device for Shuffle process

Publications (2)

Publication Number Publication Date
CN113761548A CN113761548A (en) 2021-12-07
CN113761548B true CN113761548B (en) 2024-03-08

Family

ID=78785388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010536544.5A Active CN113761548B (en) 2020-06-12 2020-06-12 Data transmission method and device for Shuffle process

Country Status (1)

Country Link
CN (1) CN113761548B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741045A (en) * 2005-04-04 2006-03-01 重庆大学 Method and apparatus for counting personnel in and out of public places
CN109150927A (en) * 2017-06-15 2019-01-04 北京京东尚科信息技术有限公司 File delivery method and device for document storage system
CN109343833A (en) * 2018-09-20 2019-02-15 北京神州泰岳软件股份有限公司 Data processing platform (DPP) and data processing method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9600526B2 (en) * 2012-12-06 2017-03-21 At&T Intellectual Property I, L.P. Generating and using temporal data partition revisions
GB2509504A (en) * 2013-01-04 2014-07-09 Ibm Accessing de-duplicated data files stored across networked servers
CN113228000A (en) * 2018-10-26 2021-08-06 斯诺弗雷克公司 Incremental refresh of materialized views

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741045A (en) * 2005-04-04 2006-03-01 重庆大学 Method and apparatus for counting personnel in and out of public places
CN109150927A (en) * 2017-06-15 2019-01-04 北京京东尚科信息技术有限公司 File delivery method and device for document storage system
CN109343833A (en) * 2018-09-20 2019-02-15 北京神州泰岳软件股份有限公司 Data processing platform (DPP) and data processing method

Also Published As

Publication number Publication date
CN113761548A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN110609872B (en) Method and apparatus for synchronizing node data
US20180341516A1 (en) Processing jobs using task dependencies
CN109873863B (en) Asynchronous calling method and device of service
US20210200806A1 (en) Method and apparatus for parallel processing of information
CN111338834B (en) Data storage method and device
US20130325960A1 (en) Client-side sharing of event information
CN111694639A (en) Method and device for updating address of process container and electronic equipment
CN111444148B (en) Data transmission method and device based on MapReduce
CN112825525B (en) Method and apparatus for processing transactions
CN110795143B (en) Method, apparatus, computing device, and medium for processing functional modules
CN110704099B (en) Alliance chain construction method and device and electronic equipment
CN113761548B (en) Data transmission method and device for Shuffle process
CN112506781B (en) Test monitoring method, device, electronic equipment, storage medium and program product
CN112860447B (en) Interaction method and system between different applications
CN113064704A (en) Task processing method and device, electronic equipment and computer readable medium
CN112988738A (en) Data slicing method and device for block chain
CN111314457B (en) Method and device for setting virtual private cloud
US20230050284A1 (en) Identity Graph Data Structure with Entity-Level Opt-Ins
CN114398098B (en) Application script execution method, device, electronic equipment and computer readable medium
CN115562892B (en) Redis-based simulation system time management method, system, device and equipment
CN113472565B (en) Method, apparatus, device and computer readable medium for expanding server function
US10884832B2 (en) Aggregation messaging within an integration environment
US20230185853A1 (en) Identity Graph Data Structure System and Method with Entity-Level Opt-Outs
CN112311833B (en) Data updating method and device
CN115705193A (en) Distributed compiling method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant