CN111444148A - Data transmission method and device based on MapReduce - Google Patents

Data transmission method and device based on MapReduce Download PDF

Info

Publication number
CN111444148A
CN111444148A CN202010273234.9A CN202010273234A CN111444148A CN 111444148 A CN111444148 A CN 111444148A CN 202010273234 A CN202010273234 A CN 202010273234A CN 111444148 A CN111444148 A CN 111444148A
Authority
CN
China
Prior art keywords
calculation result
data
result file
map
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010273234.9A
Other languages
Chinese (zh)
Other versions
CN111444148B (en
Inventor
耿筱喻
顾荣
郭俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Nanjing University
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University, Beijing ByteDance Network Technology Co Ltd filed Critical Nanjing University
Priority to CN202010273234.9A priority Critical patent/CN111444148B/en
Publication of CN111444148A publication Critical patent/CN111444148A/en
Application granted granted Critical
Publication of CN111444148B publication Critical patent/CN111444148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a data transmission method and device based on MapReduce. One embodiment of the method comprises: executing the Map task to generate a calculation result file, wherein the calculation result file comprises partitions with the number consistent with that of Reduce ends and data corresponding to the partitions; and uploading the calculation result file to a target file system providing redundant storage so that a corresponding Reduce end acquires data in the calculation result file through the target file system, wherein the target file system names the calculation result file according to a preset naming rule and stores the calculation result file according to a preset directory structure. The embodiment avoids the consumption of computing resources and the time cost caused by recalculation, improves the stability of the Shuffle process, and has better universality.

Description

Data transmission method and device based on MapReduce
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a data transmission method and device based on MapReduce.
Background
With the rapid development of computer technology, the MapReduce distributed computing framework is more and more widely applied. Between Map and Reduce procedures, Shuffle is required to enable the transfer of data from the output of the Map Task (Task) to the input of the Reduce Task. Since the Shuffle operation is an indispensable bridge in the Map process and the Reduce process, and is often accompanied by a large amount of network transmission and disk reading and writing, the performance of the Shuffle often directly affects the performance and throughput of the whole MapReduce process.
In practice, data transmission is overtime in the Shuffle process due to reasons such as node failure, network delay and high cluster load, so that a Reduce end (Reducer) cannot acquire required data from a Map end (Mapper), and the Shuffle process fails. The related method is usually to utilize the fault-tolerant mechanism provided by MapReduce itself to perform recalculation on the task that the Shuffle failed (i.e. select part of the input data to re-perform the Map process and then continue the previously interrupted Reduce process).
Disclosure of Invention
The embodiment of the application provides a data transmission method and device based on MapReduce.
In a first aspect, an embodiment of the present application provides a data transmission method based on MapReduce, which is applied to a Map end, and the method includes: executing the Map task to generate a calculation result file, wherein the calculation result file comprises partitions with the number consistent with that of Reduce ends and data corresponding to the partitions; and uploading the calculation result file to a target file system providing redundant storage so that the corresponding Reduce end acquires data in the calculation result file through the target file system, wherein the target file system names the calculation result file according to a preset naming rule and stores the calculation result file according to a preset directory structure.
In some embodiments, the calculation result file includes an index file and a data file, the data file is used for recording data, and the index file is used for identifying a start position and an end position of the data of each partition in the data file; and the method further comprises: and sending metadata information of the calculation result file to a task scheduling end (MapReduce task scheduler), wherein the metadata information comprises the corresponding relation between the calculation result file and the Map end.
In some embodiments, the predetermined naming rule includes that the name of the calculation result file includes an identifier of a corresponding Map terminal, and the data file and the index file of the calculation result file are distinguished according to a suffix.
In some embodiments, the predetermined directory structure includes a tree structure including, from top to bottom, an identification of the application, an identification of a MapReduce process belonging to the application, an identification of a Map task belonging to the MapReduce process, and an identification of a computation result file belonging to the Map task.
In a second aspect, an embodiment of the present application provides a data transmission method based on MapReduce, which is applied to a Reduce end, and the method includes: in response to determining that data acquisition from a Map end corresponding to a Reduce end fails, acquiring data of partitions corresponding to the Reduce end from a target file system providing redundant storage, wherein the target file system stores a calculation result file according to a preset directory structure, and the calculation result file comprises the partitions with the same number as the Reduce end and the data corresponding to the partitions; and executing the Reduce task by using the acquired data to generate a final result of the MapReduce process.
In some embodiments, before the obtaining the data of the partition corresponding to the Reduce end from the file system, the method further includes: acquiring an address of at least one Map end corresponding to a Reduce end from a task scheduling end, wherein the task scheduling end stores metadata information of a calculation result file, and the metadata information comprises a corresponding relation between the calculation result file and the Map end; and sending a data acquisition request for acquiring the data of the partition corresponding to the Reduce end to at least one Map end according to the address of the at least one Map end.
In a third aspect, an embodiment of the present application provides a data transmission device based on MapReduce, which is applied to a Map end, and the device includes: the first generation unit is configured to execute Map tasks to generate a calculation result file, wherein the calculation result file comprises partitions with the number consistent with that of Reduce ends and data corresponding to the partitions; and the uploading unit is configured to upload the calculation result file to a target file system providing redundant storage, so that the corresponding Reduce end acquires data in the calculation result file through the target file system, wherein the target file system names the calculation result file according to a preset naming rule, and stores the calculation result file according to a preset directory structure.
In some embodiments, the calculation result file includes an index file and a data file, the data file is used for recording data, and the index file is used for identifying a start position and an end position of the data of each partition in the data file; and the apparatus further comprises: and the sending unit is configured to send the metadata information of the calculation result file to the task scheduling terminal, wherein the metadata information comprises the corresponding relation between the calculation result file and the Map terminal.
In some embodiments, the predetermined naming rule includes that the name of the calculation result file includes an identifier of a corresponding Map terminal, and the data file and the index file of the calculation result file are distinguished according to a suffix.
In some embodiments, the predetermined directory structure includes a tree structure including, from top to bottom, an identification of the application, an identification of a MapReduce process belonging to the application, an identification of a Map task belonging to the MapReduce process, and an identification of a computation result file belonging to the Map task.
In a fourth aspect, an embodiment of the present application provides a data transmission device based on MapReduce, which is applied to a Reduce end, and the device includes: the device comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is configured to respond to the fact that data obtaining from a Map end corresponding to a Reduce end fails, and obtain data of partitions corresponding to the Reduce end from a target file system providing redundant storage, the target file system stores a calculation result file according to a preset directory structure, and the calculation result file comprises the partitions with the number consistent with that of the Reduce end and the data corresponding to the partitions; and the second generation unit is configured to execute the Reduce task by using the acquired data to generate a final result of the MapReduce process.
In some embodiments, the apparatus further comprises: the address acquisition unit is configured to acquire an address of at least one Map end corresponding to a Reduce end from a task scheduling end, wherein the task scheduling end stores metadata information of a calculation result file, and the metadata information comprises a corresponding relation between the calculation result file and the Map end; and the second obtaining unit is configured to send a data obtaining request for obtaining the data of the partition corresponding to the Reduce end to the at least one Map end according to the address of the at least one Map end.
In a fifth aspect, an embodiment of the present application provides a data transmission system based on MapReduce, where the system includes: a Map terminal configured to execute the method described in any implementation manner of the first aspect; a Reduce end configured to perform implementing the method as described in any implementation manner of the second aspect; and the target file system is configured to respond to the received calculation result file, name the calculation result file according to a preset naming rule and store the calculation result file according to a preset directory structure.
In some embodiments, the target file system is further configured to store multiple copies of the computed result file in data nodes of the target file system.
In a sixth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a seventh aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method described in any implementation manner of the first aspect.
According to the data transmission method and device based on MapReduce, a Map task is executed firstly to generate a calculation result file. The calculation result file comprises partitions with the same number as the Reduce ends and data corresponding to the partitions. And then, uploading the calculation result file to a target file system providing redundant storage, so that the corresponding Reduce end acquires data in the calculation result file through the target file system. The target file system names the calculation result file according to a preset naming rule and stores the calculation result file according to a preset directory structure. Therefore, the backup of the calculation result file is realized through the target file system, and a standby data source is provided for the Reduce end to obtain for the Shuffle failure. Therefore, the consumption of computing resources and the time cost caused by recalculation are avoided, and the stability of the Shuffle process is improved. In addition, the whole MapReduce model is not internally changed, so that the method is small in invasion to a calculation frame and has good universality.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a MapReduce-based data transmission method according to the present application;
FIG. 3 is a diagram illustrating a directory structure of a MapReduce-based data transmission method according to an embodiment of the present application;
FIG. 4 is a flow diagram of yet another embodiment of a MapReduce-based data transmission method according to the present application;
FIG. 5 is a schematic diagram of an embodiment of a MapReduce-based data transmission device according to the application;
FIG. 6 is a schematic diagram of an embodiment of a MapReduce-based data transmission device according to the application;
FIG. 7a is a timing diagram of interactions between various devices in one embodiment of a MapReduce-based data transfer according to the present application;
FIG. 7b is a schematic diagram illustrating interaction between devices in one embodiment of MapReduce-based data transfer according to the present application;
FIG. 8 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 illustrates an exemplary architecture 100 to which the MapReduce-based data transmission method or the MapReduce-based data transmission apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include end devices 101, 102, 103, a network 104, and servers 1051, 1052, 1053, 1054, 1055, 1056. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 1051. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 101, 102, 103 interact with the server 1051 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting display of calculation results, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 1051 may be a server providing various services, such as a background server providing support for applications running on the terminal devices 101, 102, 103. The background server can decompose the received computing tasks, send the decomposed computing tasks to the computing nodes for computing to generate computing results, and feed the computing results back to the terminal equipment. Specifically, the server 1051 (e.g., the MRAppMaster node) may assign the Map tasks to the Map nodes 1052, 1053 for execution. The Map node may upload intermediate result files generated by executing the Map task to the file server 1054. Reduce nodes 1055, 1056 may obtain intermediate result files from Map nodes 1052, 1053 or file servers 1054 to perform Reduce tasks and generate computation results.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the MapReduce-based data transmission method provided in the embodiment of the present application is generally executed by the servers 1052, 1053 or 1055, 1056, and accordingly, the MapReduce-based data transmission apparatus is generally disposed in the servers 1052, 1053 or 1055, 1056.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a MapReduce-based data transmission method according to the present application is shown. The data transmission method based on MapReduce can be applied to a Map terminal and comprises the following steps:
step 201, executing Map task to generate calculation result file.
In the present embodiment, an executing subject (such as the server 1052 or 1053 shown in fig. 1) of the MapReduce-based data transfer method may execute a Map task according to the received data, thereby generating a calculation result file. The calculation result file may include partitions with the same number as the Reduce ends and corresponding data.
In this embodiment, the executing entity may obtain the Map task through a wired connection manner or a wireless connection manner. For example, the execution subject may obtain the Map task from an electronic device (e.g., the server 1051 shown in fig. 1) communicatively connected thereto. The Map task may include data processing logic sent by the user through the terminal device. And the execution body can read an input data set from a designated position and process the input data set according to the data processing logic to obtain an intermediate result. The execution entity may then write the intermediate result to a memory buffer (e.g., a ring memory buffer). In the writing process, partition sorting (sort) is performed. In response to determining that the written data amount reaches a preset threshold (e.g., 80%), the execution body may initiate an overflow thread to overflow (spill) the data in the memory buffer to a local disk, thereby generating a temporary file. When all the intermediate results corresponding to the Map task are written, the execution main body can determine whether an over-written temporary file exists. In response to determining that the data does not exist, the execution main body may directly write the data in the memory buffer to the local disk, thereby generating the calculation result file. In response to determining that the data exists, the execution subject may merge (merge) the data in all the temporary files and the memory buffer according to the partition, that is, merge the data with consistent partitions. Then, the execution main body may sort the merged data again according to the partition, and write the sorted data into the local disk, thereby generating the calculation result file.
Optionally, the sorting according to partitions may specifically include sorting according to partitions to which hash values (hashes) of keys of the intermediate results belong. Optionally, the execution main body may further combine (combine) the key values with the same key.
In the MapReduce framework, the reading of the input data set by the Map terminal can be performed by a plurality of Map terminals in parallel.
In some optional implementations of this embodiment, the calculation result file may include an index file and a data file. The data file may be used for recording data. The index file may be used to identify a start position and an end position of the data of each partition in the data file.
Based on the optional implementation manner, the index file and the data file are generated on the local disk of the execution main body. Therefore, each Reduce end in the MapReduce can more quickly determine the position of the data of the corresponding partition according to the index file, and further improve the efficiency of data query and acquisition.
In some optional implementation manners of this embodiment, based on the index file and the data file included in the calculation result file, the execution main body may further send metadata information of the calculation result file to the task scheduling end. The metadata information may include a correspondence between the calculation result file and the Map terminal. Generally, the task scheduling end may be responsible for maintaining metadata information uploaded by each Map end in the MapReduce.
Based on the optional implementation manner, each Reduce end in the MapReduce may determine, according to the metadata information, a Map node where each required data is located.
Step 202, uploading the calculation result file to a target file system providing redundant storage, so that the corresponding Reduce end acquires data in the calculation result file through the target file system.
In this embodiment, the executing entity may upload the calculation result file generated in step 201 to a target file system providing redundant storage by means of wired connection or wireless connection. The target file system for providing redundant storage may name the calculation result file according to a predetermined naming rule, and store the calculation result file according to a predetermined directory structure. The naming rule and the directory structure may be any rule and target structure specified in advance. In general, the target file system that provides redundant storage may organize files through the naming conventions and directory structures described above.
In this embodiment, the target file system providing the redundant storage may be any file system that is pre-specified according to actual application requirements and is different from the Map-side local disk. The target File System providing the redundant storage may also be a File System determined according to a rule, for example, a File System capable of providing multiple copies, such as HDFS (Hadoop distributed File System).
In some optional implementation manners of this embodiment, the predetermined naming rule may include that the name of the calculation result file includes an identifier of a corresponding Map end, and the data file and the index file of the calculation result file are distinguished according to a suffix.
As an example, the Map ends marked as "0" and "1" are uploaded to the target file system after the calculation result file is written to the local disk, and the file names correspond to the unique marks of the Map ends. For example, the index file and the data file uploaded by the Map terminal identified with "0" stored in the target file system are named "0. index" and "0. data", respectively. The index file and the data file uploaded by the Map terminal with the identifier of "1" stored in the target file system are named as "1. index" and "1. data", respectively.
Based on the optional implementation mode, extra overhead caused by maintaining the file name information and the mapping relation between the Map end and the file can be avoided, so that the Reduce end can directly access the required file and data conveniently.
In some optional implementations of this embodiment, the predetermined directory structure may include a tree structure. The tree structure may include, from top to bottom, an identifier of an application, an identifier of a MapReduce process belonging to the application, an identifier of a Map task belonging to the MapReduce process, and an identifier of a calculation result file belonging to the Map task.
As an example, as shown in fig. 3, the first-level directory of the tree structure 300 may be a unique identifier (application id) of an application, for example, "application _ 0", for distinguishing data belonging to unused applications. The second-level directory of the tree structure 300 may be a unique identification (Shuffle id) of Shuffle in a certain MapReduce process. Since there may be multiple MapReduce processes in a single application, there are multiple Shuffle data. In order to distinguish data belonging to different Shuffle, unique identifications of the two Shuffle processes of "apply _ 0", such as "Shuffle _ 0" and "Shuffle _ 1", may be used as the second layer directory. The third level directory of the tree structure 300 may be a unique identifier (mapID) of a Map task, such as "Map _ 0" and "Map _ 1". The fourth layer of the tree structure 300 is an identifier of a stored calculation result file, such as "0. index" and "0. data". Therefore, the data file and the index file belonging to the same Map task can be stored under the unique identification of the Map task.
Based on the optional implementation manner, the execution main body can avoid extra overhead caused by maintaining metadata information such as a file storage path and the like, so that the query efficiency of the calculation result file in the target file system is improved. Moreover, the Reduce end can directly obtain the storage path according to the required calculation result file, and then directly access the target file system without obtaining information such as the file path and the like through other systems.
In some optional implementation manners of this embodiment, the data nodes of the target file system may further store multiple copies of each calculation result file in a distributed manner, so as to implement data backup.
At present, in one of the prior art, a Map end usually writes a calculation result file generated by a Map task into a local disk, so that required data cannot be sent to a corresponding Reduce end under the conditions of Map node failure, network delay and the like, thereby causing recalculation of a MapReduce model. In the method provided by the embodiment of the application, the calculation result file generated by the Map task is written into the local disk by the Map terminal and then uploaded to the target file system providing the redundant storage, so that the calculation result file is backed up by the target file system providing the redundant storage, and a backup data source is provided for the Shuffle failure to obtain by the Reduce terminal. Therefore, the consumption of computing resources and the time cost caused by recalculation are avoided, and the stability of the Shuffle process is improved. In addition, the whole MapReduce model is not internally changed, so that the method is small in invasion to a calculation frame and has good universality.
With continued reference to fig. 4, fig. 4 is a flowchart 400 of an embodiment of a MapReduce-based data transmission method according to an embodiment of the present application. The data transmission method based on MapReduce can be applied to Reduce terminals and comprises the following steps:
step 401, in response to determining that the data acquisition from the Map end corresponding to the Reduce end fails, acquiring data of a partition corresponding to the Reduce end from a target file system providing redundant storage.
In this embodiment, an executing body (such as the server 1055 or 1056 shown in fig. 1) of the MapReduce-based data transmission method may first determine, in various ways, that data acquisition from the Map end corresponding to the Reduce end fails. For example, it may be that the Reduce end has a communication timeout with the Map end or that the return content received from the Map end is abnormal. Then, the execution subject may obtain data of the partition corresponding to the Reduce end from the target file system providing the redundant storage. The target file system providing the redundant storage may store the calculation result file according to a predetermined directory structure. The above calculation result file may include partitions consistent with the Reduce end number and data corresponding to the partitions.
It should be noted that the target file system and the calculation result file providing the redundant storage may be consistent with the description in the foregoing embodiments, and are not described herein again.
Specifically, in response to determining that the data acquisition fails, the execution main body may acquire a storage path of the calculation result file generated by the Map side according to the generation rule of the directory structure of the target file system providing redundant storage. Then, the execution body may obtain data of the corresponding partition according to the storage path.
Optionally, based on the index file and the data file included in the calculation result file, the execution main body may parse the index file in the calculation result file according to the storage path, so as to obtain a start position and an end position of the data of the corresponding partition in the data file corresponding to the index file. Then, the execution body may read the data from the corresponding data file.
In some optional implementations of this embodiment, before the step 401, the executing body may further perform the following steps:
the method comprises the following steps that firstly, the address of at least one Map end corresponding to a Reduce end is obtained from a task scheduling end.
In these implementations, the execution subject may obtain, from the task scheduling end, an address of each Map end in the MapReduce. The task scheduling end may store metadata information of the calculation result file. The metadata information may include a correspondence between the calculation result file and the Map side.
It should be noted that the task scheduling end may be consistent with the description in the foregoing embodiments, and is not described herein again.
And secondly, sending a data acquisition request for acquiring the data of the partition corresponding to the Reduce end to at least one Map end according to the address of at least one Map end.
In these implementation manners, the execution subject may send a data acquisition request for acquiring data of a partition corresponding to the Reduce end to each Map end in the MapReduce.
It should be noted that, in the MapReduce framework, reading, by the Reduce end, data belonging to each partition from the Map end or the target file system may be performed by multiple Reduce ends in parallel.
And step 402, executing a Reduce task by using the acquired data to generate a final result of the MapReduce process.
In this embodiment, the execution subject may execute the Reduce task by using the data acquired in step 401 to generate a final result of the MapReduce process. Specifically, the executing body may first sort and combine the acquired data. And then, calculating according to the data processing logic indicated by the Reduce task, thereby obtaining the final result.
At present, in one of the prior art, when a Reduce end cannot obtain required data from a Map end due to a Map node failure, network delay, a large load and the like, a MapReduce model performs recalculation of a Map process, which results in waste of computing resources and time consumption. In the method provided by the embodiment of the application, the target file system providing the redundant storage for the calculation result file is accessed instead when the required data cannot be acquired from the Map terminal, so that the consumption of calculation resources and the cost of time caused by recalculation are avoided, and the stability of the Shuffle process is improved. In addition, the whole MapReduce model is not internally changed, so that the method is small in invasion to a calculation frame and has good universality.
With further reference to fig. 5, as an implementation of the methods shown in the above diagrams, the present application provides an embodiment of a data transmission apparatus based on MapReduce, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 5, the MapReduce-based data transmission apparatus 500 provided by the present embodiment includes a first generation unit 501 and an upload unit 502. The first generating unit 501 is configured to execute a Map task to generate a calculation result file, where the calculation result file includes partitions whose number is consistent with that of Reduce ends and data corresponding to the partitions; the uploading unit 502 is configured to upload the calculation result file to a target file system providing redundant storage, so that the corresponding Reduce end obtains data in the calculation result file through the target file system, where the target file system names the calculation result file according to a predetermined naming rule and stores the calculation result file according to a predetermined directory structure.
In the present embodiment, in the MapReduce-based data transmission apparatus 500: the detailed processing of the first generating unit 501 and the uploading unit 502 and the technical effects thereof can refer to the related descriptions of step 201 and step 202 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional implementations of this embodiment, the calculation result file may include an index file and a data file. The data file may be used for recording data. The index file may be used to identify a start position and an end position of the data of each partition in the data file. The MapReduce-based data transmission apparatus 500 may further include: and a sending unit (not shown in the figure) configured to send the metadata information of the calculation result file to the task scheduling terminal. The metadata information may include a correspondence between the calculation result file and the Map terminal.
In some optional implementation manners of this embodiment, the predetermined naming rule may include that the name of the calculation result file includes an identifier of a corresponding Map end, and the data file and the index file of the calculation result file are distinguished according to a suffix.
In some optional implementation manners of this embodiment, the predetermined directory structure may include a tree structure, where the tree structure includes, from top to bottom, an identifier of the application, an identifier of a MapReduce process belonging to the application, an identifier of a Map task belonging to the MapReduce process, and an identifier of a computation result file belonging to the Map task.
The apparatus provided by the above embodiment of the present application executes the Map task by the first generating unit 501 to generate the calculation result file. The calculation result file comprises partitions with the same number as the Reduce ends and data corresponding to the partitions. Then, the uploading unit 502 uploads the calculation result file to a target file system providing redundant storage, so that the corresponding Reduce end obtains data in the calculation result file through the target file system. The target file system names the calculation result file according to a preset naming rule and stores the calculation result file according to a preset directory structure. Therefore, the backup of the calculation result file is realized through the target file system, and a standby data source is provided for the Reduce end to obtain for the Shuffle failure. And moreover, the consumption of computing resources and the time cost caused by recalculation are avoided, and the stability of the Shuffle process is improved. In addition, the whole MapReduce model is not internally changed, so that the method is small in invasion to a calculation frame and has good universality.
With further reference to fig. 6, as an implementation of the methods shown in the above diagrams, the present application provides an embodiment of a data transmission apparatus based on MapReduce, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 4, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 6, the MapReduce-based data transmission apparatus 600 provided in this embodiment includes a first obtaining unit 601 and a second generating unit 602. The first obtaining unit 601 is configured to, in response to determining that data obtaining from a Map end corresponding to a Reduce end fails, obtain data of partitions corresponding to the Reduce end from a target file system providing redundant storage, where the target file system stores a calculation result file according to a predetermined directory structure, and the calculation result file includes partitions whose number is consistent with that of the Reduce end and data corresponding to the partitions; a second generating unit 602 configured to execute the Reduce task using the acquired data to generate a final result of the MapReduce process.
In the present embodiment, in the MapReduce-based data transmission apparatus 600: the specific processing of the first obtaining unit 601 and the second generating unit 602 and the technical effects thereof can refer to the related descriptions of step 401 and step 402 in the corresponding embodiment of fig. 4, which are not repeated herein.
In some optional implementation manners of this embodiment, the MapReduce-based data transmission apparatus 600 may further include an address obtaining unit (not shown in the figure), and a second obtaining unit (not shown in the figure). The address obtaining unit may be configured to obtain, from the task scheduling end, an address of at least one Map end corresponding to the Reduce end. The task scheduling terminal may store metadata information of the calculation result file. The metadata information may include a correspondence between the calculation result file and the Map terminal. The second obtaining unit may be configured to send, to the at least one Map end, a data obtaining request for obtaining data of the partition corresponding to the Reduce end according to the address of the at least one Map end.
In the apparatus provided in the foregoing embodiment of the present application, first, in response to determining that data acquisition from the Map end corresponding to the Reduce end fails, the first acquisition unit 601 acquires data of a partition corresponding to the Reduce end from a target file system that provides redundant storage. Wherein, the target file system stores the calculation result file according to a preset directory structure. The calculation result file comprises partitions with the same number as the Reduce ends and data corresponding to the partitions. Then, the second generating unit 602 performs a Reduce task using the acquired data to generate a final result of the MapReduce process. Therefore, the consumption of computing resources and the time cost caused by recalculation are avoided, and the stability of the Shuffle process is improved. In addition, the whole MapReduce model is not internally changed, so that the method is small in invasion to a calculation frame and has good universality.
With further reference to fig. 7a, a timing sequence 700 of interaction between various devices in one embodiment of a MapReduce-based data transmission method is shown. The MapReduce-based data transmission system can comprise: a Map side (e.g., servers 1052, 1053 shown in FIG. 1), a Reduce side (e.g., servers 1055, 1056 shown in FIG. 1), and a target file system (e.g., server 1054 shown in FIG. 1). The Map terminal may be configured to implement the MapReduce-based data transmission method described in the embodiment of fig. 2. The Reduce end may be configured to implement the MapReduce-based data transmission method described in the embodiment shown in fig. 4. The target file system may be configured to, in response to receiving the calculation result file, name the calculation result file according to a predetermined naming rule, and store the calculation result file according to a predetermined directory structure.
In some optional implementations of this embodiment, the target file system is further configured to store multiple copies of the calculation result file in a data node of the target file system.
As shown in fig. 7a, in step 701, the Map end executes a Map task to generate a calculation result file.
In step 702, the Map terminal uploads the calculation result file to a target file system providing redundant storage.
In some optional implementations of this embodiment, the Map terminal (not shown in fig. 1) sends metadata information of the calculation result file to the task scheduling terminal.
In step 703, in response to receiving the calculation result file, the target file system names the calculation result file according to a predetermined naming rule and stores the calculation result file according to a predetermined directory structure.
In this embodiment, the target file system may name the calculation result file according to a predetermined naming rule, and store the calculation result file according to a predetermined directory structure. Therefore, the target file system can uniformly manage the calculation result files uploaded by the Map terminals in the MapReduce.
It should be noted that the above-mentioned predetermined naming rules and predetermined directory structures may be consistent with the description of the foregoing embodiments, and are not described herein again.
In some optional implementations of this embodiment, the target file system providing redundant storage may further store multiple copies of the calculation result file in a data node of the target file system in a distributed manner. And therefore, the backup of the calculation result files uploaded by the Map terminals in the MapReduce is completed.
In step 704, the Reduce end obtains an address of at least one Map end corresponding to the Reduce end from the task scheduling end.
In step 705, according to the address of the at least one Map end, the Reduce end sends a data acquisition request for acquiring data of the partition corresponding to the Reduce end to the at least one Map end.
In this embodiment, as an example, as shown in fig. 7 b. The Reduce end _0 identified as "0" may send data acquisition requests to the Map end _0 identified as "0" and the Map end _1 identified as "1", respectively.
In step 706, the Reduce end receives response information of the data request sent by the corresponding Map end.
In this embodiment, after receiving a data acquisition request sent by a Reduce end, a Map end in the MapReduce usually returns, as response information, data of a partition belonging to the Reduce end corresponding to the data acquisition request in a local calculation result file of the Map end to the Reduce end that sends the data acquisition request. Therefore, the Reduce end can receive response information of the data request sent by the corresponding Map end.
For example, with continued reference to fig. 7b, the Map terminal _0 may respond to the data obtaining request of the Reduce terminal _0, and return the data of the 0 th partition in the local data file "0. data" to the Reduce terminal _0 as response information. Therefore, the Reduce end _0 may obtain the data belonging to the partition corresponding to the Reduce end _0 from the Map end _ 0.
In practice, the Reduce end usually sends a data acquisition request to all Map ends in the MapReduce. Due to possible factors such as Map node failure, high load or network delay, in some implementations, the Reduce end often cannot receive all data response information returned by the Map end. For the Map end corresponding to the data response information which is not received, the Reduce end can determine that the data acquisition from the Map end fails.
For example, referring to fig. 7b continuously, after the Reduce terminal _0 sends the data acquisition request, the Map terminal _1 may not respond to the data acquisition request in time due to node failure, network delay, node overload, and the like, so that the Reduce terminal _0 cannot take the required data in the 0 th area from the Map terminal _1, and the data acquisition fails.
In step 707, in response to determining that the data acquisition from the Map end corresponding to the Reduce end fails, the Reduce end acquires the data of the partition corresponding to the Reduce end from the target file system providing the redundant storage.
Optionally, with continued reference to fig. 7b, in response to determining that the response of the Map peer _1 to the data obtaining request is time-out, the Reduce peer _0 may access the target file system to obtain the data of the 0 th partition in the calculation result file of the Map peer _ 1. The Map peer _1 may first obtain the index file "1. index" uploaded by the Map peer _1 and stored in the target file system, and obtain the start position and the end position of the data of the 0 th partition in the data file "1. data" in the calculation result file uploaded by the Map peer _ 1. The Reduce port _0 may read the required data from the data file "1. data" according to the instruction of the index file "1. index".
Based on the optional implementation manner, because the target file system can provide multiple copy functions, and the target file system is responsible for fault tolerance of the Shuffle data (i.e., the calculation result files uploaded by the Map terminals), the problem that the required data cannot be accessed in the target file system does not occur.
In step 708, the Reduce end executes a Reduce task using the acquired data to generate a final result of the MapReduce process.
The steps 701 and 702 are respectively consistent with the steps 201 and 202 and their optional implementations in the foregoing embodiments, and the steps 704, 705, 707 and 708 are respectively consistent with the steps 401 and 402 and their optional implementations in the foregoing embodiments. The above description for step 201, step 202 and their optional implementation manners, and step 401, step 402 and their optional implementation manners also applies to step 701 and step 702, steps 704, 705, 707, and step 708, which are not described herein again.
In the data transmission system based on MapReduce provided by the above embodiments of the present application, first, a Map end executes a Map task to generate a calculation result file. And then, the Map end uploads the calculation result file to a target file system providing redundant storage. In response to receiving the calculation result file, the target file system names the calculation result file according to a preset naming rule and stores the calculation result file according to a preset directory structure. And then, the Reduce end acquires the address of at least one Map end corresponding to the Reduce end from the task scheduling end. And according to the address of at least one Map end, the Reduce end sends a data acquisition request for acquiring the data of the partition corresponding to the Reduce end to at least one Map end. Then, in response to determining that the data acquisition from the Map end corresponding to the Reduce end fails, the Reduce end acquires the data of the partition corresponding to the Reduce end from the target file system providing the redundant storage. And finally, the Reduce end executes the Reduce task by using the acquired data to generate a final result of the MapReduce process.
Therefore, recalculation tasks caused by failure of the Shuffle due to Map node faults, overhigh load or network delay are avoided, the stability of the Shuffle process in the MapReduce model is improved, and the task running speed is further improved. Moreover, the method and the device can utilize the existing file system supporting multiple copies of data to back up the Shuffle file, only one pluggable connection layer between the computing frame and the file system is added, the whole MapReduce model is not changed at all, and the invasion to the computing frame is small. Therefore, the solution of the Shuffle process aiming at the MapReduce model can be applied to any big data computing framework based on the MapReduce model.
Referring now to FIG. 8, shown is a schematic diagram of an electronic device (e.g., server 1052 or 1055 of FIG. 1) 800 suitable for use in implementing embodiments of the present application. The server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, input devices 806 including, for example, a touch screen, a touch pad, a keyboard, a mouse, etc., output devices 807 including, for example, a liquid Crystal Display (L CD, &lTtTtranslation = L "&gTtL &lTt/T &gTtiquid Crystal Display), a speaker, a vibrator, etc., storage devices 808 including, for example, a magnetic tape, a hard disk, etc., and communication devices 809. the communication devices 809 may allow the electronic device 800 to communicate wirelessly or wiredly with other devices to exchange data.
In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present application.
It should be noted that the computer readable medium described in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: executing the Map task to generate a calculation result file, wherein the calculation result file comprises partitions with the number consistent with that of Reduce ends and data corresponding to the partitions; and uploading the calculation result file to a target file system providing redundant storage so that the corresponding Reduce end acquires data in the calculation result file through the target file system, wherein the target file system names the calculation result file according to a preset naming rule and stores the calculation result file according to a preset directory structure.
Computer program code for carrying out operations for embodiments of the present application may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprises a first generation unit and an uploading unit. For example, the first generating unit may also be described as a "unit that performs Map tasks to generate a calculation result file, where the calculation result file includes partitions and data corresponding to the partitions whose number is consistent with the number of Reduce ends".
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present application is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present application are mutually replaced to form the technical solution.

Claims (12)

1. A data transmission method based on MapReduce is applied to a Map end and comprises the following steps:
executing a Map task to generate a calculation result file, wherein the calculation result file comprises partitions with the number consistent with that of Reduce ends and data corresponding to the partitions;
and uploading the calculation result file to a target file system providing redundant storage so that a corresponding Reduce end acquires data in the calculation result file through the target file system, wherein the target file system names the calculation result file according to a preset naming rule and stores the calculation result file according to a preset directory structure.
2. The method of claim 1, wherein the calculation result file comprises an index file and a data file, the data file is used for recording data, and the index file is used for identifying a starting position and an ending position of the data of each partition in the data file; and
the method further comprises the following steps:
and sending metadata information of the calculation result file to a task scheduling end, wherein the metadata information comprises the corresponding relation between the calculation result file and the Map end.
3. The method according to claim 2, wherein the predetermined naming rule comprises that the name of the calculation result file comprises an identifier of a corresponding Map terminal, and the data file and the index file of the calculation result file are distinguished according to a suffix.
4. The method according to one of claims 1 to 3, wherein the predetermined directory structure comprises a tree structure comprising, from top to bottom, an identification of an application, an identification of a MapReduce procedure belonging to an application, an identification of a Map task belonging to a MapReduce procedure, an identification of a computation result file belonging to a Map task.
5. A data transmission method based on MapReduce is applied to Reduce ends and comprises the following steps:
in response to determining that data acquisition from a Map end corresponding to the Reduce end fails, acquiring data of partitions corresponding to the Reduce end from a target file system providing redundant storage, wherein the target file system stores a calculation result file according to a preset directory structure, and the calculation result file comprises the partitions with the same number as the Reduce end and the data corresponding to the partitions;
and executing a Reduce task by using the acquired data to generate a final result of the MapReduce process.
6. The method of claim 5, wherein prior to said obtaining data of the partition corresponding to the Reduce end from the file system, the method further comprises:
acquiring an address of at least one Map end corresponding to the Reduce end from a task scheduling end, wherein the task scheduling end stores metadata information of the calculation result file, and the metadata information comprises a corresponding relation between the calculation result file and the Map end;
and sending a data acquisition request for acquiring the data of the partition corresponding to the Reduce end to the at least one Map end according to the address of the at least one Map end.
7. A data transmission device based on MapReduce is applied to a Map terminal and comprises:
the first generation unit is configured to execute Map tasks to generate a calculation result file, wherein the calculation result file comprises partitions with the number consistent with that of Reduce ends and data corresponding to the partitions;
and the uploading unit is configured to upload the calculation result file to a target file system so that a corresponding Reduce end acquires data in the calculation result file through the target file system, wherein the target file system names the calculation result file according to a preset naming rule, and stores the calculation result file according to a preset directory structure.
8. A data transmission device based on MapReduce is applied to Reduce end, includes:
the data acquisition method comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is configured to respond to the determination that a data acquisition request to a Map end corresponding to a Reduce end fails, and acquire data of partitions corresponding to the Reduce end from a target file system, the target file system stores a calculation result file according to a preset directory structure, and the calculation result file comprises the partitions with the number consistent with that of the Reduce end and the data corresponding to the partitions;
a second generating unit configured to execute a Reduce task using the acquired data to generate a final result of the MapReduce process.
9. A MapReduce-based data transmission system comprises:
a Map terminal configured to perform implementing the method according to any one of claims 1-4;
a Reduce end configured to perform implementing the method as claimed in any one of claims 5-6;
and the target file system is configured to respond to the received calculation result file, name the calculation result file according to a preset naming rule and store the calculation result file according to a preset directory structure.
10. The system of claim 9, wherein the target file system is further configured to store multiple copies of the computation result file in data nodes of the target file system.
11. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202010273234.9A 2020-04-09 2020-04-09 Data transmission method and device based on MapReduce Active CN111444148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010273234.9A CN111444148B (en) 2020-04-09 2020-04-09 Data transmission method and device based on MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010273234.9A CN111444148B (en) 2020-04-09 2020-04-09 Data transmission method and device based on MapReduce

Publications (2)

Publication Number Publication Date
CN111444148A true CN111444148A (en) 2020-07-24
CN111444148B CN111444148B (en) 2023-09-05

Family

ID=71652931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010273234.9A Active CN111444148B (en) 2020-04-09 2020-04-09 Data transmission method and device based on MapReduce

Country Status (1)

Country Link
CN (1) CN111444148B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312316A (en) * 2021-07-28 2021-08-27 阿里云计算有限公司 Data processing method and device
WO2023005366A1 (en) * 2021-07-28 2023-02-02 华为云计算技术有限公司 Computing method and apparatus, device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102209087A (en) * 2010-03-31 2011-10-05 国际商业机器公司 Method and system for MapReduce data transmission in data center having SAN
CN102456031A (en) * 2010-10-26 2012-05-16 腾讯科技(深圳)有限公司 MapReduce system and method for processing data streams
CN103176843A (en) * 2013-03-20 2013-06-26 百度在线网络技术(北京)有限公司 File migration method and file migration equipment of Map Reduce distributed system
CN103617033A (en) * 2013-11-22 2014-03-05 北京掌阔移动传媒科技有限公司 Method, client and system for processing data on basis of MapReduce
CN105446896A (en) * 2014-08-29 2016-03-30 国际商业机器公司 MapReduce application cache management method and device
CN105955819A (en) * 2016-04-18 2016-09-21 中国科学院计算技术研究所 Data transmission method and system based on Hadoop
CN107220069A (en) * 2017-07-03 2017-09-29 中国科学院计算技术研究所 A kind of Shuffle methods for Nonvolatile memory
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102209087A (en) * 2010-03-31 2011-10-05 国际商业机器公司 Method and system for MapReduce data transmission in data center having SAN
CN102456031A (en) * 2010-10-26 2012-05-16 腾讯科技(深圳)有限公司 MapReduce system and method for processing data streams
CN103176843A (en) * 2013-03-20 2013-06-26 百度在线网络技术(北京)有限公司 File migration method and file migration equipment of Map Reduce distributed system
CN103617033A (en) * 2013-11-22 2014-03-05 北京掌阔移动传媒科技有限公司 Method, client and system for processing data on basis of MapReduce
CN105446896A (en) * 2014-08-29 2016-03-30 国际商业机器公司 MapReduce application cache management method and device
CN105955819A (en) * 2016-04-18 2016-09-21 中国科学院计算技术研究所 Data transmission method and system based on Hadoop
CN107220069A (en) * 2017-07-03 2017-09-29 中国科学院计算技术研究所 A kind of Shuffle methods for Nonvolatile memory
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312316A (en) * 2021-07-28 2021-08-27 阿里云计算有限公司 Data processing method and device
CN113312316B (en) * 2021-07-28 2022-01-04 阿里云计算有限公司 Data processing method and device
WO2023005366A1 (en) * 2021-07-28 2023-02-02 华为云计算技术有限公司 Computing method and apparatus, device and storage medium

Also Published As

Publication number Publication date
CN111444148B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
US11461270B2 (en) Shard splitting
US9747124B2 (en) Distributed virtual machine image management for cloud computing
US20160092493A1 (en) Executing map-reduce jobs with named data
US10127243B2 (en) Fast recovery using self-describing replica files in a distributed storage system
CN109508326B (en) Method, device and system for processing data
CN109144785B (en) Method and apparatus for backing up data
JP6774971B2 (en) Data access accelerator
CN111338834B (en) Data storage method and device
CN110673959A (en) System, method and apparatus for processing tasks
Merceedi et al. A comprehensive survey for hadoop distributed file system
CN111611622A (en) Block chain-based file storage method and electronic equipment
CN112597126A (en) Data migration method and device
CN111444148B (en) Data transmission method and device based on MapReduce
US11157456B2 (en) Replication of data in a distributed file system using an arbiter
US9684668B1 (en) Systems and methods for performing lookups on distributed deduplicated data systems
CN110798358B (en) Distributed service identification method and device, computer readable medium and electronic equipment
CN111767169A (en) Data processing method and device, electronic equipment and storage medium
CN105653566B (en) A kind of method and device for realizing database write access
CN111984686A (en) Data processing method and device
CN112148705A (en) Data migration method and device
CN113535673B (en) Method and device for generating configuration file and data processing
CN110543520B (en) Data migration method and device
CN112148461A (en) Application scheduling method and device
US10162626B2 (en) Ordered cache tiering for program build files
US11379147B2 (en) Method, device, and computer program product for managing storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu

Applicant after: NANJING University

Applicant after: Douyin Vision Co.,Ltd.

Address before: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu

Applicant before: NANJING University

Applicant before: Tiktok vision (Beijing) Co.,Ltd.

Address after: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu

Applicant after: NANJING University

Applicant after: Tiktok vision (Beijing) Co.,Ltd.

Address before: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu

Applicant before: NANJING University

Applicant before: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant