CN111444148A - Data transmission method and device based on MapReduce - Google Patents
Data transmission method and device based on MapReduce Download PDFInfo
- Publication number
- CN111444148A CN111444148A CN202010273234.9A CN202010273234A CN111444148A CN 111444148 A CN111444148 A CN 111444148A CN 202010273234 A CN202010273234 A CN 202010273234A CN 111444148 A CN111444148 A CN 111444148A
- Authority
- CN
- China
- Prior art keywords
- calculation result
- data
- file
- map
- result file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/06—Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请实施例公开了基于MapReduce的数据传输方法和装置。该方法的一具体实施方式包括:执行Map任务,以生成计算结果文件,其中,该计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;将该计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过该目标文件系统获取该计算结果文件中的数据,其中,该目标文件系统按照预定的命名规则对该计算结果文件进行命名,以及按照预定目录结构对该计算结果文件进行存储。该实施方式避免了由于重计算而带来的计算资源消耗和时间花销,提升Shuffle过程的稳定性,并且具备较好的普适性。
The embodiments of the present application disclose a MapReduce-based data transmission method and device. A specific implementation of the method includes: executing a Map task to generate a calculation result file, wherein the calculation result file includes the same number of partitions as the Reduce side and their corresponding data; uploading the calculation result file to provide redundancy The stored target file system, so that the corresponding Reduce side obtains the data in the calculation result file through the target file system, wherein the target file system names the calculation result file according to a predetermined naming rule, and according to a predetermined directory structure The calculation result file is stored. This embodiment avoids computing resource consumption and time consumption caused by recalculation, improves the stability of the Shuffle process, and has better universality.
Description
技术领域technical field
本申请实施例涉及计算机技术领域,具体涉及基于MapReduce的数据传输方法和装置。The embodiments of the present application relate to the field of computer technologies, and in particular, to a MapReduce-based data transmission method and apparatus.
背景技术Background technique
随着计算机技术的飞速发展,MapReduce分布式计算框架取得越来越广泛的应用。在Map(映射)和Reduce(规约)过程之间,需要Shuffle(洗牌)来实现数据从Map任务(Task)的输出到Reduce任务的输入的传输。由于Shuffle操作是连接Map过程和Reduce过程中必不可少的桥梁,且常常伴随着大量的网络传输和磁盘读写,因此Shuffle的性能往往直接影响整个MapReduce过程的性能和吞吐量。With the rapid development of computer technology, MapReduce distributed computing framework has been widely used. Between the Map (mapping) and Reduce (reduction) processes, Shuffle (shuffling) is required to realize the transmission of data from the output of the Map task (Task) to the input of the Reduce task. Since the Shuffle operation is an indispensable bridge between the Map process and the Reduce process, and is often accompanied by a large number of network transfers and disk reads and writes, the performance of Shuffle often directly affects the performance and throughput of the entire MapReduce process.
实践中,节点故障、网络延迟、集群负载较高等原因会导致Shuffle过程中数据传输超时,从而Reduce端(Reducer)无法从Map端(Mapper)获取到所需的数据,进而使得Shuffle过程失败。相关的方式通常是利用MapReduce本身提供的容错机制,对Shuffle失败的任务执行重计算(即选取部分输入数据重新执行Map过程再接续此前被中断的Reduce过程)。In practice, node failures, network delays, and high cluster load may cause data transmission to time out during the shuffle process. As a result, the Reducer cannot obtain the required data from the Mapper, and the shuffle process fails. A related method is usually to use the fault-tolerant mechanism provided by MapReduce itself to perform recalculation on tasks that fail Shuffle (that is, select some input data to re-execute the Map process and then continue the previously interrupted Reduce process).
发明内容SUMMARY OF THE INVENTION
本申请实施例提出了基于MapReduce的数据传输方法和装置。The embodiments of the present application propose a MapReduce-based data transmission method and device.
第一方面,本申请实施例提供了一种基于MapReduce的数据传输方法,应用于Map端,该方法包括:执行Map任务,以生成计算结果文件,其中,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;将计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过目标文件系统获取计算结果文件中的数据,其中,目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。In a first aspect, an embodiment of the present application provides a MapReduce-based data transmission method, which is applied to the Map side. The method includes: executing a Map task to generate a calculation result file, wherein the calculation result file includes the same number as the Reduce side. The partition and its corresponding data; upload the calculation result file to the target file system that provides redundant storage, so that the corresponding Reduce side obtains the data in the calculation result file through the target file system, wherein the target file system is named according to the predetermined name The rules name the calculation result files, and store the calculation result files according to a predetermined directory structure.
在一些实施例中,上述计算结果文件包括索引文件和数据文件,上述数据文件用于记录数据,上述索引文件用于标识各分区的数据在数据文件中的起始位置和结束位置;以及该方法还包括:向任务调度端(MapReduce任务调度器)发送计算结果文件的元数据信息,其中,元数据信息包括计算结果文件与Map端的对应关系。In some embodiments, the above-mentioned calculation result file includes an index file and a data file, the above-mentioned data file is used to record data, and the above-mentioned index file is used to identify the starting position and ending position of the data of each partition in the data file; and the method The method also includes: sending the metadata information of the calculation result file to the task scheduler (MapReduce task scheduler), wherein the metadata information includes the corresponding relationship between the calculation result file and the Map side.
在一些实施例中,上述预定的命名规则包括计算结果文件的名称中包括对应的Map端的标识,以及根据后缀区分计算结果文件的数据文件和索引文件。In some embodiments, the above-mentioned predetermined naming rule includes that the name of the calculation result file includes a corresponding map terminal identifier, and distinguishes the data file and the index file of the calculation result file according to the suffix.
在一些实施例中,上述预定目录结构包括树形结构,上述树形结构自顶向下包括应用的标识、属于应用的MapReduce过程的标识、属于MapReduce过程的Map任务的标识、属于Map任务的计算结果文件的标识。In some embodiments, the predetermined directory structure includes a tree structure, and the tree structure includes, from top to bottom, an identifier of an application, an identifier of a MapReduce process belonging to an application, an identifier of a Map task belonging to a MapReduce process, and a calculation belonging to a Map task. The ID of the result file.
第二方面,本申请实施例提供了一种基于MapReduce的数据传输方法,应用于Reduce端,该方法包括:响应于确定从与Reduce端对应的Map端获取数据失败,从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据,其中,目标文件系统按照预定目录结构存储计算结果文件,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;利用所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。In a second aspect, an embodiment of the present application provides a MapReduce-based data transmission method, which is applied to the Reduce side. The method includes: in response to determining that the acquisition of data from the Map side corresponding to the Reduce side fails, from the target providing redundant storage The file system obtains the data of the partition corresponding to the Reduce side, wherein the target file system stores the calculation result file according to a predetermined directory structure, and the calculation result file includes the same number of partitions as the Reduce side and their corresponding data; using the obtained data, Execute a Reduce task to generate the final result of the MapReduce process.
在一些实施例中,在上述从文件系统获取与Reduce端对应的分区的数据之前,该方法还包括:从任务调度端获取与Reduce端对应的至少一个Map端的地址,其中,任务调度端存储有计算结果文件的元数据信息,元数据信息包括计算结果文件与Map端的对应关系;根据至少一个Map端的地址,向至少一个Map端发送用于获取与Reduce端对应的分区的数据的数据获取请求。In some embodiments, before acquiring the data of the partition corresponding to the Reduce side from the file system, the method further includes: acquiring the address of at least one Map side corresponding to the Reduce side from the task scheduling side, wherein the task scheduling side stores a Metadata information of the calculation result file, the metadata information includes the corresponding relationship between the calculation result file and the Map side; according to the address of the at least one Map side, a data acquisition request for acquiring the data of the partition corresponding to the Reduce side is sent to the at least one Map side.
第三方面,本申请实施例提供了一种基于MapReduce的数据传输装置,应用于Map端,该装置包括:第一生成单元,被配置成执行Map任务,以生成计算结果文件,其中,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;上传单元,被配置成将计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过目标文件系统获取计算结果文件中的数据,其中,目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。In a third aspect, an embodiment of the present application provides a MapReduce-based data transmission device, which is applied to the Map side. The device includes: a first generating unit configured to execute a Map task to generate a calculation result file, wherein the calculation result The file includes the same number of partitions as the Reduce side and its corresponding data; the uploading unit is configured to upload the calculation result file to the target file system that provides redundant storage, so that the corresponding Reduce side can obtain the calculation result through the target file system The data in the file, wherein the target file system names the calculation result file according to a predetermined naming rule, and stores the calculation result file according to a predetermined directory structure.
在一些实施例中,上述计算结果文件包括索引文件和数据文件,上述数据文件用于记录数据,上述索引文件用于标识各分区的数据在数据文件中的起始位置和结束位置;以及该装置还包括:发送单元,被配置成向任务调度端发送计算结果文件的元数据信息,其中,元数据信息包括计算结果文件与Map端的对应关系。In some embodiments, the above-mentioned calculation result file includes an index file and a data file, the above-mentioned data file is used to record data, and the above-mentioned index file is used to identify the starting position and ending position of the data of each partition in the data file; and the device It also includes: a sending unit configured to send the metadata information of the calculation result file to the task scheduling terminal, wherein the metadata information includes the corresponding relationship between the calculation result file and the Map terminal.
在一些实施例中,上述预定的命名规则包括计算结果文件的名称中包括对应的Map端的标识,以及根据后缀区分计算结果文件的数据文件和索引文件。In some embodiments, the above-mentioned predetermined naming rule includes that the name of the calculation result file includes a corresponding map terminal identifier, and distinguishes the data file and the index file of the calculation result file according to the suffix.
在一些实施例中,上述预定目录结构包括树形结构,上述树形结构自顶向下包括应用的标识、属于应用的MapReduce过程的标识、属于MapReduce过程的Map任务的标识、属于Map任务的计算结果文件的标识。In some embodiments, the predetermined directory structure includes a tree structure, and the tree structure includes, from top to bottom, an identifier of an application, an identifier of a MapReduce process belonging to an application, an identifier of a Map task belonging to a MapReduce process, and a calculation belonging to a Map task. The ID of the result file.
第四方面,本申请实施例提供了一种基于MapReduce的数据传输装置,应用于Reduce端,该装置包括:第一获取单元,被配置成响应于确定从与Reduce端对应的Map端获取数据失败,从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据,其中,目标文件系统按照预定目录结构存储计算结果文件,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;第二生成单元,被配置成利用所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。In a fourth aspect, an embodiment of the present application provides a MapReduce-based data transmission device, which is applied to the Reduce side. The device includes: a first acquisition unit, configured to respond to determining that acquisition of data from the Map side corresponding to the Reduce side fails. , obtain the data of the partition corresponding to the Reduce side from the target file system that provides redundant storage, wherein the target file system stores the calculation result file according to a predetermined directory structure, and the calculation result file includes the same number of partitions as the Reduce side and their corresponding partitions. data; a second generating unit configured to utilize the acquired data to execute a Reduce task to generate the final result of the MapReduce process.
在一些实施例中,该装置还包括:地址获取单元,被配置成从任务调度端获取与Reduce端对应的至少一个Map端的地址,其中,任务调度端存储有计算结果文件的元数据信息,元数据信息包括计算结果文件与Map端的对应关系;第二获取单元,被配置成根据至少一个Map端的地址,向至少一个Map端发送用于获取与Reduce端对应的分区的数据的数据获取请求。In some embodiments, the apparatus further includes: an address obtaining unit, configured to obtain the address of at least one Map end corresponding to the Reduce end from the task scheduling end, wherein the task scheduling end stores metadata information of the calculation result file, The data information includes the corresponding relationship between the calculation result file and the Map side; the second acquisition unit is configured to send a data acquisition request for acquiring the data of the partition corresponding to the Reduce side to the at least one Map side according to the address of the at least one Map side.
第五方面,本申请实施例提供了一种基于MapReduce的数据传输系统,该系统包括:Map端,被配置成执行实现如第一方面中任一实现方式描述的方法;Reduce端,被配置成执行实现如第二方面中任一实现方式描述的方法;目标文件系统,被配置成响应于接收到计算结果文件,按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。In a fifth aspect, an embodiment of the present application provides a MapReduce-based data transmission system, the system includes: a Map end, configured to execute the method described in any of the implementation manners in the first aspect; and a Reduce end, configured to Execution implements the method described in any implementation manner of the second aspect; the target file system is configured to, in response to receiving the calculation result file, name the calculation result file according to a predetermined naming rule, and name the calculation result file according to a predetermined directory structure file is stored.
在一些实施例中,上述目标文件系统还被配置成在目标文件系统的数据节点中存储计算结果文件的多个副本。In some embodiments, the target file system described above is further configured to store multiple copies of the calculation result file in the data nodes of the target file system.
第六方面,本申请实施例提供了一种电子设备,该电子设备包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。In a sixth aspect, an embodiment of the present application provides an electronic device, the electronic device includes: one or more processors; a storage device on which one or more programs are stored; when one or more programs are stored by one or more The multiple processors execute such that the one or more processors implement a method as described in any one of the implementations of the first aspect.
第七方面,本申请实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面中任一实现方式描述的方法。In a seventh aspect, an embodiment of the present application provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any implementation manner of the first aspect.
本申请实施例提供的基于MapReduce的数据传输方法和装置,首先执行Map任务,以生成计算结果文件。其中,上述计算结果文件中包括与Reduce端数目一致的分区及其对应的数据。而后,将计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过目标文件系统获取计算结果文件中的数据。其中,上述目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。从而实现了通过目标文件系统进行计算结果文件的备份,为Shuffle失败提供备用数据源以供Reduce端获取。从而避免了由于重计算而带来的计算资源消耗和时间花销,提升Shuffle过程的稳定性。并且,由于未对整个MapReduce模型内部进行改动,对计算框架的侵入性很小,具备较好的普适性。In the MapReduce-based data transmission method and device provided by the embodiments of the present application, a Map task is first executed to generate a calculation result file. The above calculation result file includes the same number of partitions and their corresponding data as the Reduce side. Then, the calculation result file is uploaded to the target file system that provides redundant storage, so that the corresponding Reduce side obtains the data in the calculation result file through the target file system. The target file system names the calculation result file according to a predetermined naming rule, and stores the calculation result file according to a predetermined directory structure. In this way, the backup of the calculation result file is realized through the target file system, and an alternate data source is provided for the Reduce side to obtain the shuffle failure. This avoids the consumption of computing resources and time costs caused by recalculation, and improves the stability of the Shuffle process. Moreover, since the entire MapReduce model has not been changed, it is less intrusive to the computing framework and has good universality.
附图说明Description of drawings
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:
图1是本申请的一个实施例可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application may be applied;
图2是根据本申请的基于MapReduce的数据传输方法的一个实施例的流程图;2 is a flowchart of an embodiment of a MapReduce-based data transmission method according to the present application;
图3是根据本申请的实施例的基于MapReduce的数据传输方法的一个目录结构的示意图;3 is a schematic diagram of a directory structure of a MapReduce-based data transmission method according to an embodiment of the present application;
图4是根据本申请的基于MapReduce的数据传输方法的又一个实施例的流程图;FIG. 4 is a flowchart of another embodiment of a MapReduce-based data transmission method according to the present application;
图5是根据本申请的基于MapReduce的数据传输装置的一个实施例的结构示意图;5 is a schematic structural diagram of an embodiment of a MapReduce-based data transmission device according to the present application;
图6是根据本申请的基于MapReduce的数据传输装置的一个实施例的结构示意图;6 is a schematic structural diagram of an embodiment of a MapReduce-based data transmission device according to the present application;
图7a是根据本申请的基于MapReduce的数据传输的一个实施例中各个设备之间交互的时序图;7a is a sequence diagram of interaction between various devices in an embodiment of MapReduce-based data transmission according to the present application;
图7b是根据本申请的基于MapReduce的数据传输的一个实施例中各个设备之间交互的示意图;7b is a schematic diagram of interaction between various devices in an embodiment of MapReduce-based data transmission according to the present application;
图8是适于用来实现本申请的实施例的电子设备的结构示意图。FIG. 8 is a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
图1示出了可以应用本申请的基于MapReduce的数据传输方法或基于MapReduce的数据传输装置的示例性架构100。FIG. 1 shows an
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器1051、1052、1053、1054、1055、1056。网络104用以在终端设备101、102、103和服务器1051之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the
终端设备101、102、103通过网络104与服务器1051交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是具有显示屏并且支持计算结果显示的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。The
服务器1051可以是提供各种服务的服务器,例如为终端设备101、102、103上运行的应用提供支持的后台服务器。后台服务器可以对接收的计算任务进行分解并发送至计算节点进行计算生成计算结果,并将计算结果反馈给终端设备。具体地,服务器1051(例如MRAppMaster节点)可以将Map任务分配至Map节点1052、1053执行。Map节点可以将执行Map任务所生成的中间结果文件上传至文件服务器1054。Reduce节点1055、1056可以从Map节点1052、1053或文件服务器1054获取中间结果文件以执行Reduce任务,并生成计算结果。The
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。It should be noted that the server may be hardware or software. When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server. When the server is software, it can be implemented as a plurality of software or software modules (for example, software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
需要说明的是,本申请实施例所提供的基于MapReduce的数据传输方法一般由服务器1052、1053或1055、1056执行,相应地,基于MapReduce的数据传输装置一般设置于服务器1052、1053或1055、1056中。It should be noted that the MapReduce-based data transmission method provided by the embodiments of the present application is generally executed by the
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
继续参考图2,示出了根据本申请的基于MapReduce的数据传输方法的一个实施例的流程200。该基于MapReduce的数据传输方法可以应用于Map端,其可以包括以下步骤:Continuing to refer to FIG. 2 , a
步骤201,执行Map任务,以生成计算结果文件。Step 201: Execute a Map task to generate a calculation result file.
在本实施例中,基于MapReduce的数据传输方法的执行主体(如图1所示的服务器1052或1053)可以根据接收到的数据来执行Map任务,从而生成计算结果文件。其中,上述计算结果文件中可以包括与Reduce端数目一致的分区及其对应的数据。In this embodiment, the execution body of the MapReduce-based data transmission method (the
在本实施例中,上述执行主体可以通过有线连接方式或者无线连接方式获取Map任务。例如,上述执行主体可以从与之通信连接的电子设备(例如图1所示的服务器1051)获取Map任务。其中,上述Map任务可以包括用户通过终端设备所发送的数据处理逻辑。以及,上述执行主体可以从指定位置读取输入的数据集,并且按照上述数据处理逻辑对上述输入的数据集进行处理,得到中间结果。而后,上述执行主体可以将得到的中间结果写入到内存缓冲(例如环形内存缓冲区)中。并且,在上述写入的过程中进行分区(partition)排序(sort)。响应于确定写入的数据量达到预设阈值(例如80%),上述执行主体可以启动溢写线程将上述内存缓冲中的数据溢写(spill)到本地磁盘,从而生成临时文件。当上述Map任务对应的中间结果均已写入完成,上述执行主体可以确定是否存在溢写的临时文件。响应于确定不存在,上述执行主体可以将上述内存缓冲中的数据直接写入上述本地磁盘,从而生成上述计算结果文件。响应于确定存在,上述执行主体可以对上述所有的临时文件和内存缓冲中的数据按照上述分区进行归并(merge),即将分区一致的数据进行合并。而后,上述执行主体可以根据分区对合并的数据再一次进行排序,以及将上述排序后的数据写入上述本地磁盘,从而生成上述计算结果文件。In this embodiment, the above-mentioned execution subject may acquire the Map task in a wired connection manner or a wireless connection manner. For example, the above-mentioned execution body may acquire the Map task from an electronic device (for example, the
可选地,上述根据分区进行排序具体可以包括根据中间结果的键的哈希值(hash)所属的分区进行排序。可选地,上述执行主体还可以将上述键值对相同键的值进行组合(combine)。Optionally, the above-mentioned sorting according to the partition may specifically include sorting according to the partition to which the hash value (hash) of the key of the intermediate result belongs. Optionally, the above-mentioned execution body may also combine the above-mentioned key value with the value of the same key.
需要说明的是,在MapReduce框架中,上述Map端读取输入的数据集通常可以由多个Map端并行地进行。It should be noted that, in the MapReduce framework, the above-mentioned Map-side reading the input data set can usually be performed by multiple Map-sides in parallel.
在本实施例的一些可选的实现方式中,上述计算结果文件可以包括索引文件和数据文件。其中,上述数据文件可以用于记录数据。上述索引文件可以用于标识各分区的数据在上述数据文件中的起始位置和结束位置。In some optional implementations of this embodiment, the above calculation result file may include an index file and a data file. Wherein, the above-mentioned data file can be used for recording data. The above-mentioned index file may be used to identify the start position and end position of the data of each partition in the above-mentioned data file.
基于上述可选的实现方式,上述执行主体的本地磁盘上会生成上述索引文件和数据文件。从而MapReduce中的各Reduce端可以根据上述索引文件更加快捷地确定各自对应的分区的数据的位置,进而提升数据查询和获取的效率。Based on the foregoing optional implementation manner, the foregoing index file and data file are generated on the local disk of the foregoing execution body. Therefore, each reducer in MapReduce can more quickly determine the data location of its corresponding partition according to the above index file, thereby improving the efficiency of data query and acquisition.
在本实施例的一些可选的实现方式中,基于上述计算结果文件包括的索引文件和数据文件,上述执行主体还可以向任务调度端发送上述计算结果文件的元数据信息。其中,上述元数据信息可以包括上述计算结果文件与Map端的对应关系。通常,上述任务调度端可以负责维护上述MapReduce中的各Map端所上传的元数据信息。In some optional implementations of this embodiment, based on the index file and the data file included in the calculation result file, the execution subject may also send metadata information of the calculation result file to the task scheduling end. The above metadata information may include the corresponding relationship between the above calculation result file and the Map terminal. Generally, the above-mentioned task scheduling terminal may be responsible for maintaining the metadata information uploaded by each Map terminal in the above-mentioned MapReduce.
基于上述可选的实现方式,上述MapReduce中的各Reduce端可以根据上述元数据信息确定各自所需的数据所在的Map节点。Based on the foregoing optional implementation manner, each Reduce end in the foregoing MapReduce may determine the Map node where the data required by each is located according to the foregoing metadata information.
步骤202,将计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过目标文件系统获取计算结果文件中的数据。Step 202: Upload the calculation result file to the target file system that provides redundant storage, so that the corresponding Reduce side obtains the data in the calculation result file through the target file system.
在本实施例中,上述执行主体可以通过有线连接或无线连接的方式将上述步骤201所生成的计算结果文件上传至提供冗余存储的目标文件系统。其中,上述提供冗余存储的目标文件系统可以按照预定的命名规则对上述计算结果文件进行命名,以及按照预定目录结构对上述计算结果文件进行存储。其中,上述命名规则和目录结构可以为预先指定的任意规则和目标结构。通常,上述提供冗余存储的目标文件系统可以通过上述命名规则和目录结构来组织文件。In this embodiment, the above-mentioned execution body may upload the calculation result file generated in the above-mentioned
在本实施例中,上述提供冗余存储的目标文件系统可以是根据实际的应用需求,预先指定的、区别于Map端本地磁盘的任意文件系统。上述提供冗余存储的目标文件系统也可以是根据规则而定的文件系统,例如能够提供多副本的文件系统,如HDFS(HadoopDistributed File System,Hadoop分布式文件系统)。In this embodiment, the above-mentioned target file system for providing redundant storage may be any file system that is pre-specified according to actual application requirements and is different from the local disk on the map side. The above-mentioned target file system for providing redundant storage may also be a file system determined according to rules, for example, a file system capable of providing multiple copies, such as HDFS (Hadoop Distributed File System, Hadoop Distributed File System).
在本实施例的一些可选的实现方式中,上述预定的命名规则可以包括上述计算结果文件的名称中包括对应的Map端的标识,以及根据后缀区分上述计算结果文件的数据文件和索引文件。In some optional implementations of this embodiment, the above-mentioned predetermined naming rule may include that the name of the above-mentioned calculation result file includes a corresponding map terminal identifier, and the data file and the index file of the above-mentioned calculation result file are distinguished according to the suffix.
作为示例,标识为“0”和“1”的Map端分别在写完计算结果文件到本地磁盘后将其上传到目标文件系统中,并且文件名均与Map端的唯一标识对应。例如,上述目标文件系统中所存储的标识为“0”的Map端所上传的索引文件和数据文件分别命名为“0.index”和“0.data”。上述目标文件系统中所存储的标识为“1”的Map端所上传的索引文件和数据文件分别命名为“1.index”和“1.data”。As an example, the Map side with the identifiers "0" and "1" uploads the calculation result file to the target file system after writing the calculation result file to the local disk, and the file names correspond to the unique identifier of the Map side. For example, the index file and the data file uploaded by the Map terminal with the identifier "0" stored in the above-mentioned target file system are named "0.index" and "0.data" respectively. The index file and data file uploaded by the Map terminal with the identifier "1" stored in the above target file system are named "1.index" and "1.data" respectively.
基于上述可选的实现方式,可以避免由于维护文件名信息、Map端和文件之间的映射关系而造成的额外开销,从而便于Reduce端直接访问所需要的文件和数据。Based on the above-mentioned optional implementation manner, the additional overhead caused by maintaining file name information and the mapping relationship between the Map side and the file can be avoided, thereby facilitating the Reduce side to directly access the required files and data.
在本实施例的一些可选的实现方式中,上述预定目录结构可以包括树形结构。其中,上述树形结构自顶向下可以包括应用的标识、属于应用的MapReduce过程的标识、属于MapReduce过程的Map任务的标识、属于Map任务的计算结果文件的标识。In some optional implementations of this embodiment, the above-mentioned predetermined directory structure may include a tree structure. Wherein, the above-mentioned tree structure from top to bottom may include the identifier of the application, the identifier of the MapReduce process belonging to the application, the identifier of the Map task belonging to the MapReduce process, and the identifier of the calculation result file belonging to the Map task.
作为示例,如图3所示,上述树形结构300的第一层目录可以为应用的唯一标识(applicaitionID),例如“应用_0”,用以区分属于不用应用的数据。上述树形结构300的第二层目录可以为某一次MapReduce过程中Shuffle的唯一标识(shuffleID)。由于单个应用中可能存在多次MapReduce过程,因此就存在多次Shuffle的数据。为了区分属于不同Shuffle的数据,可以将“应用_0”的两次Shuffle过程的唯一标识,例如“洗牌_0”和“洗牌_1”作为第二层目录。上述树形结构300的第三层目录可以为某个Map任务的唯一标识(MapID),例如“映射_0”和“映射_1。上述树形结构300的第四层为存储的计算结果文件的标识,例如“0.index”和“0.data”。从而,属于同一个Map任务的数据文件和索引文件会存储在该Map任务的唯一标识下。As an example, as shown in FIG. 3 , the first level directory of the
基于上述可选的实现方式,上述执行主体可以避免由于维护文件存储路径等元数据信息而造成的额外开销,从而提升上述计算结果文件在上述目标文件系统中的查询效率。而且,Reduce端可以直接根据需要的计算结果文件获取其存储路径,进而直接访问上述目标文件系统,而不需要先通过其他系统获取文件路径等信息。Based on the above-mentioned optional implementation manner, the above-mentioned execution body can avoid additional overhead caused by maintaining metadata information such as file storage paths, thereby improving the query efficiency of the above-mentioned calculation result file in the above-mentioned target file system. Moreover, the Reduce side can directly obtain its storage path according to the required calculation result file, and then directly access the above-mentioned target file system, without first obtaining the file path and other information through other systems.
在本实施例的一些可选的实现方式中,上述目标文件系统的数据节点中还可以分布式存储每个计算结果文件的多个副本,从而实现数据的备份。In some optional implementations of this embodiment, the data nodes of the target file system may also store multiple copies of each calculation result file in a distributed manner, thereby implementing data backup.
目前,现有技术之一通常是仅由Map端将Map任务生成的计算结果文件写入本地磁盘,导致在Map节点故障、网络延迟等情况下无法向对应的Reduce端发送所需的数据,从而引起MapReduce模型的重计算。而本申请的上述实施例提供的方法,通过Map端将Map任务生成的计算结果文件写入本地磁盘后上传至提供冗余存储的目标文件系统,实现了通过提供冗余存储的目标文件系统进行计算结果文件的备份,为Shuffle失败提供备用数据源以供Reduce端获取。从而避免了由于重计算而带来的计算资源消耗和时间花销,提升Shuffle过程的稳定性。并且,由于未对整个MapReduce模型内部进行改动,对计算框架的侵入性很小,具备较好的普适性。At present, one of the existing technologies is usually that only the Map side writes the calculation result file generated by the Map task to the local disk, so that the required data cannot be sent to the corresponding Reduce side in the case of Map node failure, network delay, etc. Causes a recalculation of the MapReduce model. However, in the method provided by the above-mentioned embodiments of the present application, the calculation result file generated by the Map task is written into the local disk by the Map side and then uploaded to the target file system providing redundant storage, so that the target file system providing redundant storage can be used to perform data processing. Backup of the calculation result file to provide an alternate data source for the Reduce side to obtain when Shuffle fails. This avoids the consumption of computing resources and time costs caused by recalculation, and improves the stability of the Shuffle process. Moreover, since the entire MapReduce model has not been changed, it is less intrusive to the computing framework and has good universality.
继续参见图4,图4是根据本申请实施例的基于MapReduce的数据传输方法的一个实施例的流程400。该基于MapReduce的数据传输方法可以应用于Reduce端,其可以包括以下步骤:Continuing to refer to FIG. 4 , FIG. 4 is a
步骤401,响应于确定从与Reduce端对应的Map端获取数据失败,从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据。
在本实施例中,基于MapReduce的数据传输方法的执行主体(如图1所示的服务器1055或1056)可以首先通过各种方式确定从与Reduce端对应的Map端获取数据失败。例如,可以是Reduce端与Map端通信超时或从Map端接收的返回内容异常。而后,上述执行主体可以从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据。其中,上述提供冗余存储的目标文件系统可以按照预定目录结构存储计算结果文件。上述计算结果文件中可以包括与Reduce端数目一致的分区及其对应的数据。In this embodiment, the execution body of the MapReduce-based data transmission method (the
需要说明的是,上述提供冗余存储的目标文件系统和计算结果文件可以与前述实施例中的描述一致,此处不再赘述。It should be noted that, the above-mentioned target file system and calculation result file for providing redundant storage may be consistent with the descriptions in the foregoing embodiments, which will not be repeated here.
具体地,响应于确定获取上述数据失败,上述执行主体可以按照上述提供冗余存储的目标文件系统的目录结构的生成规则获取Map端所生成的计算结果文件的存储路径。而后,上述执行主体可以根据上述存储路径获取对应的分区的数据。Specifically, in response to determining that the acquisition of the data fails, the execution body may acquire the storage path of the calculation result file generated by the Map side according to the above-mentioned generation rule of the directory structure of the target file system providing redundant storage. Then, the above-mentioned execution body may acquire the data of the corresponding partition according to the above-mentioned storage path.
可选地,基于上述计算结果文件包括的索引文件和数据文件,上述执行主体可以根据上述存储路径解析上述计算结果文件中的索引文件,从而得到对应的分区的数据在上述索引文件对应的数据文件中的起始位置和结束位置。而后,上述执行主体可以从上述对应的数据文件中读取上述数据。Optionally, based on the index file and data file included in the above-mentioned calculation result file, the above-mentioned execution body can parse the index file in the above-mentioned calculation result file according to the above-mentioned storage path, thereby obtaining the data of the corresponding partition in the data file corresponding to the above-mentioned index file. start and end positions in . Then, the above-mentioned execution body can read the above-mentioned data from the above-mentioned corresponding data file.
在本实施例的一些可选的实现方式中,在上述步骤401之前,上述执行主体还可以执行如下步骤:In some optional implementation manners of this embodiment, before the foregoing
第一步,从任务调度端获取与Reduce端对应的至少一个Map端的地址。The first step is to obtain the address of at least one Map side corresponding to the Reduce side from the task scheduling side.
在这些实现方式中,上述执行主体可以从任务调度端获取上述MapReduce中各Map端的地址。其中,上述任务调度端可以存储上述计算结果文件的元数据信息。上述元数据信息可以包括上述计算结果文件与Map端的对应关系。In these implementation manners, the above-mentioned execution subject may obtain the addresses of each Map terminal in the above-mentioned MapReduce from the task scheduling terminal. The above-mentioned task scheduling terminal may store the metadata information of the above-mentioned calculation result file. The above-mentioned metadata information may include the corresponding relationship between the above-mentioned calculation result file and the Map terminal.
需要说明的是,上述任务调度端可以与前述实施例中的描述一致,此处不再赘述。It should be noted that, the above task scheduling terminal may be consistent with the description in the foregoing embodiment, and details are not repeated here.
第二步,根据至少一个Map端的地址,向至少一个Map端发送用于获取与Reduce端对应的分区的数据的数据获取请求。In the second step, according to the address of the at least one Map side, a data acquisition request for acquiring the data of the partition corresponding to the Reduce side is sent to the at least one Map side.
在这些实现方式中,上述执行主体可以向上述MapReduce中各Map端发送获取与Reduce端对应的分区的数据的数据获取请求。In these implementation manners, the above-mentioned execution subject may send a data acquisition request to each Map terminal in the above-mentioned MapReduce to acquire the data of the partition corresponding to the Reduce terminal.
需要说明的是,在MapReduce框架中,上述Reduce端从Map端或上述目标文件系统读取属于各自分区的数据通常可以由多个Reduce端并行地进行。It should be noted that, in the MapReduce framework, the above-mentioned Reduce side reads the data belonging to the respective partitions from the Map side or the above-mentioned target file system, which can usually be performed by multiple Reduce sides in parallel.
步骤402,利用所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。
在本实施例中,上述执行主体可以利用步骤401所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。具体地,上述执行主体可以首先将获取的数据进行排序组合。而后,按照Reduce任务所指示的数据处理逻辑进行计算,从而得到上述最终结果。In this embodiment, the above-mentioned execution body may use the data obtained in
目前,现有技术之一通常是在由Map节点故障、网络延迟、负载较大等原因引起Reduce端无法从Map端获取所需的数据时,由MapReduce模型进行Map过程的重计算,导致计算资源的浪费和时间的消耗。而本申请的上述实施例提供的方法,通过在无法从Map端获取所需数据时转而访问为计算结果文件提供冗余存储的目标文件系统,从而避免了由于重计算而带来的计算资源消耗和时间花销,提升Shuffle过程的稳定性。并且,由于未对整个MapReduce模型内部进行改动,对计算框架的侵入性很小,具备较好的普适性。At present, one of the existing technologies usually recalculates the Map process by the MapReduce model when the Reduce side cannot obtain the required data from the Map side due to the failure of the Map node, network delay, heavy load, etc., resulting in computing resources. waste and time consuming. However, in the method provided by the above-mentioned embodiments of the present application, when the required data cannot be obtained from the Map side, the target file system that provides redundant storage for the calculation result file is accessed instead, thereby avoiding the computational resources caused by recalculation. Consumption and time cost to improve the stability of the Shuffle process. Moreover, since the entire MapReduce model has not been changed, it is less intrusive to the computing framework and has good universality.
进一步参考图5,作为对上述各图所示方法的实现,本申请提供了基于MapReduce的数据传输装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。With further reference to FIG. 5 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of a MapReduce-based data transmission apparatus, which corresponds to the method embodiment shown in FIG. 2 . Can be used in various electronic devices.
如图5所示,本实施例提供的基于MapReduce的数据传输装置500包括第一生成单元501和上传单元502。其中,第一生成单元501,被配置成执行Map任务,以生成计算结果文件,其中,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;上传单元502,被配置成将计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过目标文件系统获取计算结果文件中的数据,其中,目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。As shown in FIG. 5 , the MapReduce-based
在本实施例中,基于MapReduce的数据传输装置500中:第一生成单元501和上传单元502的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201和步骤202的相关说明,在此不再赘述。In this embodiment, in the
在本实施例的一些可选的实现方式中,上述计算结果文件可以包括索引文件和数据文件。上述数据文件可以用于记录数据。上述索引文件可以用于标识各分区的数据在上述数据文件中的起始位置和结束位置。上述基于MapReduce的数据传输装置500还可以包括:发送单元(图中未示出),被配置成向任务调度端发送计算结果文件的元数据信息。其中,上述元数据信息可以包括计算结果文件与Map端的对应关系。In some optional implementations of this embodiment, the above calculation result file may include an index file and a data file. The above data files can be used to record data. The above-mentioned index file may be used to identify the start position and end position of the data of each partition in the above-mentioned data file. The above-mentioned MapReduce-based
在本实施例的一些可选的实现方式中,上述预定的命名规则可以包括计算结果文件的名称中包括对应的Map端的标识,以及根据后缀区分计算结果文件的数据文件和索引文件。In some optional implementations of this embodiment, the above-mentioned predetermined naming rules may include that the name of the calculation result file includes a corresponding map identifier, and the data file and the index file of the calculation result file are distinguished according to the suffix.
在本实施例的一些可选的实现方式中,上述预定目录结构可以包括树形结构,上述树形结构自顶向下包括应用的标识、属于应用的MapReduce过程的标识、属于MapReduce过程的Map任务的标识、属于Map任务的计算结果文件的标识。In some optional implementations of this embodiment, the predetermined directory structure may include a tree structure, and the tree structure includes, from top to bottom, an identifier of an application, an identifier of a MapReduce process belonging to the application, and a Map task belonging to the MapReduce process. , the identifier of the calculation result file belonging to the Map task.
本申请的上述实施例提供的装置,通过第一生成单元501执行Map任务,以生成计算结果文件。其中,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据。而后,上传单元502将计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过目标文件系统获取计算结果文件中的数据。其中,目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。从而实现了通过目标文件系统进行计算结果文件的备份,为Shuffle失败提供备用数据源以供Reduce端获取。而且避免了由于重计算而带来的计算资源消耗和时间花销,提升Shuffle过程的稳定性。并且,由于未对整个MapReduce模型内部进行改动,对计算框架的侵入性很小,具备较好的普适性。In the apparatus provided by the above embodiments of the present application, the
进一步参考图6,作为对上述各图所示方法的实现,本申请提供了基于MapReduce的数据传输装置的一个实施例,该装置实施例与图4所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。With further reference to FIG. 6 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of a MapReduce-based data transmission apparatus. The apparatus embodiment corresponds to the method embodiment shown in FIG. 4 . Can be used in various electronic devices.
如图6所示,本实施例提供的基于MapReduce的数据传输装置600包括第一获取单元601和第二生成单元602。其中,第一获取单元601,被配置成响应于确定从与Reduce端对应的Map端获取数据失败,从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据,其中,目标文件系统按照预定目录结构存储计算结果文件,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;第二生成单元602,被配置成利用所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。As shown in FIG. 6 , the MapReduce-based
在本实施例中,基于MapReduce的数据传输装置600中:第一获取单元601和第二生成单元602的具体处理及其所带来的技术效果可分别参考图4对应实施例中的步骤401和步骤402的相关说明,在此不再赘述。In this embodiment, in the MapReduce-based data transmission device 600: the specific processing of the first obtaining
在本实施例的一些可选的实现方式中,上述基于MapReduce的数据传输装置600还可以包括地址获取单元(图中未示出)、第二获取单元(图中未示出)。其中,上述地址获取单元,可以被配置成从任务调度端获取与Reduce端对应的至少一个Map端的地址。上述任务调度端可以存储有计算结果文件的元数据信息。上述元数据信息可以包括计算结果文件与Map端的对应关系。上述第二获取单元,可以被配置成根据至少一个Map端的地址,向至少一个Map端发送用于获取与Reduce端对应的分区的数据的数据获取请求。In some optional implementations of this embodiment, the above-mentioned MapReduce-based
本申请的上述实施例提供的装置,首先,第一获取单元601响应于确定从与Reduce端对应的Map端获取数据失败,从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据。其中,目标文件系统按照预定目录结构存储计算结果文件。计算结果文件中包括与Reduce端数目一致的分区及其对应的数据。而后,第二生成单元602利用所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。从而避免了由于重计算而带来的计算资源消耗和时间花销,提升Shuffle过程的稳定性。并且,由于未对整个MapReduce模型内部进行改动,对计算框架的侵入性很小,具备较好的普适性。In the apparatus provided by the above embodiments of the present application, first, the first obtaining
进一步参考图7a,其示出了基于MapReduce的数据传输方法的一个实施例中各个设备之间交互的时序700。该基于MapReduce的数据传输系统可以包括:Map端(例如图1所示的服务器1052、1053),Reduce端(例如图1所示的服务器1055、1056),目标文件系统(例如图1所示的服务器1054)。其中,上述Map端,可以被配置成实现如图2所述的实施例所描述的基于MapReduce的数据传输方法。上述Reduce端,可以被配置成实现如图4所述的实施例所描述的基于MapReduce的数据传输方法。上述目标文件系统,可以被配置成响应于接收到计算结果文件,按照预定的命名规则对上述计算结果文件进行命名,以及按照预定目录结构对上述计算结果文件进行存储。With further reference to FIG. 7a, a
在本实施例的一些可选的实现方式中,上述目标文件系统还被配置成在上述目标文件系统的数据节点中存储上述计算结果文件的多个副本。In some optional implementations of this embodiment, the above-mentioned target file system is further configured to store multiple copies of the above-mentioned calculation result file in the data nodes of the above-mentioned target file system.
如图7a所示,在步骤701中,Map端执行Map任务,以生成计算结果文件。As shown in Fig. 7a, in step 701, the Map side executes the Map task to generate a calculation result file.
在步骤702中,Map端将计算结果文件上传至提供冗余存储的目标文件系统。In step 702, the Map side uploads the calculation result file to the target file system that provides redundant storage.
在本实施例的一些可选的实现方式中,Map端(图1中未示出)向任务调度端发送计算结果文件的元数据信息。In some optional implementations of this embodiment, the Map terminal (not shown in FIG. 1 ) sends the metadata information of the calculation result file to the task scheduling terminal.
在步骤703中,响应于接收到计算结果文件,目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。In step 703, in response to receiving the calculation result file, the target file system names the calculation result file according to a predetermined naming rule, and stores the calculation result file according to a predetermined directory structure.
在本实施例中,上述目标文件系统可以按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。从而,上述目标文件系统可以将上述MapReduce中的各Map端上传的计算结果文件进行统一管理。In this embodiment, the above-mentioned target file system may name the calculation result file according to a predetermined naming rule, and store the calculation result file according to a predetermined directory structure. Therefore, the above-mentioned target file system can uniformly manage the calculation result files uploaded by each Map terminal in the above-mentioned MapReduce.
需要说明的是,上述预定的命名规则和预定目录结构可以与前述实施例的描述一致,此处不再赘述。It should be noted that, the above-mentioned predetermined naming rules and predetermined directory structures may be consistent with the descriptions of the foregoing embodiments, which will not be repeated here.
在本实施例的一些可选的实现方式中,上述提供冗余存储的目标文件系统还可以将上述计算结果文件的多个副本分布式存储于上述目标文件系统的数据节点中。从而完成对上述MapReduce中的各Map端上传的计算结果文件的备份。In some optional implementation manners of this embodiment, the above-mentioned target file system providing redundant storage may further store multiple copies of the above-mentioned calculation result file in a data node of the above-mentioned target file system in a distributed manner. Thus, the backup of the calculation result files uploaded by each Map terminal in the above MapReduce is completed.
在步骤704中,Reduce端从任务调度端获取与Reduce端对应的至少一个Map端的地址。In step 704, the Reduce side obtains the address of at least one Map side corresponding to the Reduce side from the task scheduling side.
在步骤705中,根据至少一个Map端的地址,Reduce端向至少一个Map端发送用于获取与Reduce端对应的分区的数据的数据获取请求。In step 705, according to the address of the at least one Map side, the Reduce side sends a data acquisition request for acquiring the data of the partition corresponding to the Reduce side to the at least one Map side.
在本实施例中,作为示例,如图7b所示。标识为“0”的Reduce端_0可以分别向标识为“0”的Map端_0和标识为“1”的Map端_1发送数据获取请求。In this embodiment, as an example, as shown in Fig. 7b. The Reduce side_0 with the identifier "0" can send a data acquisition request to the Map side_0 with the identifier "0" and the Map side_1 with the identifier "1" respectively.
在步骤706中,Reduce端接收对应的Map端发送的数据请求的响应信息。In step 706, the Reduce side receives the response information of the data request sent by the corresponding Map side.
在本实施例中,上述MapReduce中的Map端在接收到Reduce端发送的数据获取请求后,通常会将Map端本地的计算结果文件中属于上述数据获取请求对应的Reduce端的分区的数据作为响应信息,返回给发送上述数据获取请求的Reduce端。从而,上述Reduce端可以接收对应的Map端发送的数据请求的响应信息。In this embodiment, after receiving the data acquisition request sent by the Reduce side, the Map side in the above-mentioned MapReduce usually takes the data of the partition of the Reduce side corresponding to the above-mentioned data acquisition request in the local calculation result file of the Map side as the response information , which is returned to the Reduce side that sent the above data acquisition request. Therefore, the above-mentioned Reduce side can receive the response information of the data request sent by the corresponding Map side.
作为示例,继续参见图7b,Map端_0可以响应Reduce端_0的数据获取请求,将本地数据文件“0.data”中第0个分区的数据作为响应信息返回给Reduce端_0。从而,上述Reduce端_0可以从Map端_0获取属于Reduce端_0对应的分区的数据。As an example, continue to refer to Figure 7b, Map side_0 may respond to the data acquisition request of Reduce side_0, and return the data of the 0th partition in the local data file "0.data" as response information to Reduce side_0. Therefore, the above-mentioned Reduce side_0 can obtain the data belonging to the partition corresponding to the Reduce side_0 from the Map side_0.
实践中,Reduce端通常向上述MapReduce中所有的Map端发送数据获取请求。由于可能存在的Map节点故障、负载较高或网络延迟等因素,在一些实现方式中,Reduce端往往不能接收到全部的Map端所返回的数据响应信息。对于未接收到的数据响应信息所对应的Map端,Reduce端可以确定从该Map端获取数据失败。In practice, the Reduce side usually sends data acquisition requests to all the Map sides in the above MapReduce. Due to possible factors such as Map node failure, high load or network delay, in some implementations, the Reduce side often cannot receive all the data response information returned by the Map side. For the Map side corresponding to the unreceived data response information, the Reduce side can determine that it fails to obtain data from the Map side.
作为示例,继续参见图7b,在Reduce端_0发送数据获取请求之后,Map端_1可能由于节点故障、网络延迟、节点负载过高等因素无法及时响应上述数据获取请求,从而Reduce端_0无法从Map端_1处拿到所需的第0区的数据,获取数据失败。As an example, continue to refer to Figure 7b, after Reduce side_0 sends a data acquisition request, Map side_1 may not be able to respond to the above data acquisition request in time due to factors such as node failure, network delay, and high node load, so that Reduce side_0 cannot respond to the above data acquisition request in time. Get the required data of the 0th area from the Map side_1, and the data acquisition fails.
在步骤707中,响应于确定从与Reduce端对应的Map端获取数据失败,Reduce端从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据。In step 707, in response to determining that the acquisition of data from the Map side corresponding to the Reduce side fails, the Reduce side acquires data of the partition corresponding to the Reduce side from the target file system providing redundant storage.
可选地,继续参见图7b,响应于确定Map端_1对上述数据获取请求的响应超时,上述Reduce端_0可以访问目标文件系统,以获取Map端_1的计算结果文件中第0分区的数据。Map端_1可以首先获取目标文件系统中存储的Map端_1上传的索引文件“1.index”,得到Map端_1上传的计算结果文件中第0分区的数据在数据文件“1.data”中的起始位置和结束位置。上述Reduce端_0可以按照上述索引文件“1.index”的指示从上述数据文件“1.data”中读取所需的数据。Optionally, continuing to refer to FIG. 7b, in response to determining that the response of Map side_1 to the above data acquisition request times out, the above-mentioned Reduce side_0 can access the target file system to obtain the 0th partition in the calculation result file of Map side_1. The data. Map side_1 can first obtain the index file "1.index" uploaded by Map side_1 stored in the target file system, and obtain the data of the 0th partition in the calculation result file uploaded by Map side_1 in the data file "1.data" ” start and end positions. The above-mentioned Reduce side_0 can read the required data from the above-mentioned data file "1.data" according to the instruction of the above-mentioned index file "1.index".
基于上述可选的实现方式,由于上述目标文件系统可以提供多副本功能,由上述目标文件系统负责Shuffle数据(即各Map端上传的计算结果文件)的容错,因此不会发生所需数据在上述目标文件系统中访问不到的问题。Based on the above-mentioned optional implementation manner, since the above-mentioned target file system can provide a multi-copy function, the above-mentioned target file system is responsible for the fault tolerance of Shuffle data (that is, the calculation result file uploaded by each Map terminal), so the required data will not be stored in the above-mentioned An inaccessible problem in the target file system.
在步骤708中,Reduce端利用所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。In step 708, the Reduce side uses the acquired data to execute the Reduce task to generate the final result of the MapReduce process.
上述步骤701和步骤702分别与前述实施例中的步骤201和步骤202及其可选的实现方式一致,上述步骤704、705、707和步骤708分别与前述实施例中的步骤401和步骤402及其可选的实现方式一致。上文针对步骤201、步骤202及其可选的实现方式和步骤401、步骤402及其可选的实现方式的描述也适用于步骤701和步骤702,步骤704、705、707和步骤708,此处不再赘述。The above steps 701 and 702 are respectively consistent with the
本申请的上述实施例提供的基于MapReduce的数据传输系统,首先,Map端执行Map任务,以生成计算结果文件。而后,Map端将计算结果文件上传至提供冗余存储的目标文件系统。响应于接收到计算结果文件,目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。接下来,Reduce端从任务调度端获取与Reduce端对应的至少一个Map端的地址。根据至少一个Map端的地址,Reduce端向至少一个Map端发送用于获取与Reduce端对应的分区的数据的数据获取请求。而后,响应于确定从与Reduce端对应的Map端获取数据失败,Reduce端从提供冗余存储的目标文件系统获取与Reduce端对应的分区的数据。最后,Reduce端利用所获取的数据,执行Reduce任务,以生成MapReduce过程的最终结果。In the MapReduce-based data transmission system provided by the above embodiments of the present application, first, the Map side executes the Map task to generate a calculation result file. Then, the Map side uploads the calculation result file to the target file system that provides redundant storage. In response to receiving the calculation result file, the target file system names the calculation result file according to a predetermined naming rule, and stores the calculation result file according to a predetermined directory structure. Next, the Reduce side obtains the address of at least one Map side corresponding to the Reduce side from the task scheduling side. According to the address of the at least one Map side, the Reduce side sends a data acquisition request for acquiring the data of the partition corresponding to the Reduce side to the at least one Map side. Then, in response to determining that the acquisition of data from the Map side corresponding to the Reduce side fails, the Reduce side acquires the data of the partition corresponding to the Reduce side from the target file system providing redundant storage. Finally, the Reduce side uses the acquired data to execute the Reduce task to generate the final result of the MapReduce process.
从而避免了因Map节点故障、负载过高或网络时延导致Shuffle失败而引发的重计算任务,提升了MapReduce模型中Shuffle过程的稳定性,进而提升任务运行速度。而且,本申请可以利用现有支持数据多副本的文件系统对Shuffle文件进行备份,只增加了一层可插拔的计算框架与文件系统的连接层,并没有对整个MapReduce模型做任何改动,对计算框架侵入性很小。从而,本申请针对MapReduce模型的Shuffle过程的解决方案,对于任意基于MapReduce模型的大数据计算框架均可适用。This avoids recomputing tasks caused by shuffle failures caused by Map node failure, high load, or network delay, improves the stability of the shuffle process in the MapReduce model, and improves the running speed of tasks. Moreover, the present application can use the existing file system that supports multiple copies of data to back up Shuffle files, and only adds a layer of pluggable computing framework and the connection layer of the file system, and does not make any changes to the entire MapReduce model. Computational frameworks are minimally intrusive. Therefore, the solution of the present application for the Shuffle process of the MapReduce model is applicable to any big data computing framework based on the MapReduce model.
下面参考图8,其示出了适于用来实现本申请的实施例的电子设备(例如图1中的服务器1052或1055)800的结构示意图。图8示出的服务器仅仅是一个示例,不应对本申请的实施例的功能和使用范围带来任何限制。Referring next to FIG. 8 , it shows a schematic structural diagram of an electronic device (eg, the
如图8所示,电子设备800可以包括处理装置(例如中央处理器、图形处理器等)801,其可以根据存储在只读存储器(ROM)802中的程序或者从存储装置808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有电子设备800操作所需的各种程序和数据。处理装置801、ROM802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8 , an
通常,以下装置可以连接至I/O接口805:包括例如触摸屏、触摸板、键盘、鼠标、等的输入装置806;包括例如液晶显示器(LCD,Liquid Crystal Display)、扬声器、振动器等的输出装置807;包括例如磁带、硬盘等的存储装置808;以及通信装置809。通信装置809可以允许电子设备800与其他设备进行无线或有线通信以交换数据。虽然图8示出了具有各种装置的电子设备800,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图8中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。In general, the following devices can be connected to the I/O interface 805:
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置809从网络上被下载和安装,或者从存储装置808被安装,或者从ROM 802被安装。在该计算机程序被处理装置801执行时,执行本申请的实施例的方法中限定的上述功能。In particular, according to embodiments of the present application, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the
需要说明的是,本申请的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium described in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In the embodiments of the present application, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. While in embodiments of the present application, a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . The program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:执行Map任务,以生成计算结果文件,其中,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据;将计算结果文件上传至提供冗余存储的目标文件系统,以使对应的Reduce端通过目标文件系统获取计算结果文件中的数据,其中,目标文件系统按照预定的命名规则对计算结果文件进行命名,以及按照预定目录结构对计算结果文件进行存储。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is made to: execute a Map task to generate a calculation result file, wherein the calculation result file includes and The number of partitions and their corresponding data on the reduce side is the same; upload the calculation result file to the target file system that provides redundant storage, so that the corresponding reduce side can obtain the data in the calculation result file through the target file system, where the target file system The calculation result file is named according to a predetermined naming rule, and the calculation result file is stored according to a predetermined directory structure.
可以以一种或多种程序设计语言或其组合来编写用于执行本申请实施例的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the embodiments of the present application may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
附图中的流程图和框图,图示了按照本申请的各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器,包括第一生成单元、上传单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一生成单元还可以被描述为“执行Map任务,以生成计算结果文件的单元,其中,计算结果文件中包括与Reduce端数目一致的分区及其对应的数据”。The units involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner. The described unit may also be provided in the processor, for example, it may be described as: a processor including a first generating unit and an uploading unit. The names of these units do not constitute a limitation on the unit itself in some cases. For example, the first generation unit may also be described as "a unit that executes a Map task to generate a calculation result file, where the calculation result file is It includes the same number of partitions as the Reduce side and its corresponding data".
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present application and an illustration of the applied technical principles. It should be understood by those skilled in the art that the scope of the invention involved in the embodiments of the present application is not limited to the technical solution formed by the specific combination of the above technical features, and should also cover, without departing from the above inventive concept, the above Other technical solutions formed by any combination of technical features or their equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the embodiments of the present application (but not limited to) with similar functions.
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010273234.9A CN111444148B (en) | 2020-04-09 | 2020-04-09 | Data transmission method and device based on MapReduce |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010273234.9A CN111444148B (en) | 2020-04-09 | 2020-04-09 | Data transmission method and device based on MapReduce |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111444148A true CN111444148A (en) | 2020-07-24 |
CN111444148B CN111444148B (en) | 2023-09-05 |
Family
ID=71652931
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010273234.9A Active CN111444148B (en) | 2020-04-09 | 2020-04-09 | Data transmission method and device based on MapReduce |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111444148B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312316A (en) * | 2021-07-28 | 2021-08-27 | 阿里云计算有限公司 | Data processing method and device |
WO2023005366A1 (en) * | 2021-07-28 | 2023-02-02 | 华为云计算技术有限公司 | Computing method and apparatus, device and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102209087A (en) * | 2010-03-31 | 2011-10-05 | 国际商业机器公司 | Method and system for MapReduce data transmission in data center having SAN |
CN102456031A (en) * | 2010-10-26 | 2012-05-16 | 腾讯科技(深圳)有限公司 | MapReduce system and method for processing data stream |
CN103176843A (en) * | 2013-03-20 | 2013-06-26 | 百度在线网络技术(北京)有限公司 | File migration method and file migration equipment of Map Reduce distributed system |
CN103617033A (en) * | 2013-11-22 | 2014-03-05 | 北京掌阔移动传媒科技有限公司 | Method, client and system for processing data on basis of MapReduce |
CN105446896A (en) * | 2014-08-29 | 2016-03-30 | 国际商业机器公司 | MapReduce application cache management method and device |
CN105955819A (en) * | 2016-04-18 | 2016-09-21 | 中国科学院计算技术研究所 | Data transmission method and system based on Hadoop |
CN107220069A (en) * | 2017-07-03 | 2017-09-29 | 中国科学院计算技术研究所 | A kind of Shuffle methods for Nonvolatile memory |
CN108595268A (en) * | 2018-04-24 | 2018-09-28 | 咪咕文化科技有限公司 | Data distribution method and device based on MapReduce and computer-readable storage medium |
-
2020
- 2020-04-09 CN CN202010273234.9A patent/CN111444148B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102209087A (en) * | 2010-03-31 | 2011-10-05 | 国际商业机器公司 | Method and system for MapReduce data transmission in data center having SAN |
CN102456031A (en) * | 2010-10-26 | 2012-05-16 | 腾讯科技(深圳)有限公司 | MapReduce system and method for processing data stream |
CN103176843A (en) * | 2013-03-20 | 2013-06-26 | 百度在线网络技术(北京)有限公司 | File migration method and file migration equipment of Map Reduce distributed system |
CN103617033A (en) * | 2013-11-22 | 2014-03-05 | 北京掌阔移动传媒科技有限公司 | Method, client and system for processing data on basis of MapReduce |
CN105446896A (en) * | 2014-08-29 | 2016-03-30 | 国际商业机器公司 | MapReduce application cache management method and device |
CN105955819A (en) * | 2016-04-18 | 2016-09-21 | 中国科学院计算技术研究所 | Data transmission method and system based on Hadoop |
CN107220069A (en) * | 2017-07-03 | 2017-09-29 | 中国科学院计算技术研究所 | A kind of Shuffle methods for Nonvolatile memory |
CN108595268A (en) * | 2018-04-24 | 2018-09-28 | 咪咕文化科技有限公司 | Data distribution method and device based on MapReduce and computer-readable storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312316A (en) * | 2021-07-28 | 2021-08-27 | 阿里云计算有限公司 | Data processing method and device |
CN113312316B (en) * | 2021-07-28 | 2022-01-04 | 阿里云计算有限公司 | Data processing method and device |
WO2023005366A1 (en) * | 2021-07-28 | 2023-02-02 | 华为云计算技术有限公司 | Computing method and apparatus, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111444148B (en) | 2023-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11711420B2 (en) | Automated management of resource attributes across network-based services | |
US9747124B2 (en) | Distributed virtual machine image management for cloud computing | |
JP2022542836A (en) | NODE DATA SYNCHRONIZATION METHOD AND APPARATUS, SYSTEM, ELECTRONIC DEVICE, STORAGE MEDIUM AND COMPUTER PROGRAM | |
CN112965945B (en) | Data storage method, device, electronic equipment and computer readable medium | |
CN109508326B (en) | Method, device and system for processing data | |
US11076020B2 (en) | Dynamically transitioning the file system role of compute nodes for provisioning a storlet | |
US11625192B2 (en) | Peer storage compute sharing using memory buffer | |
US10282115B2 (en) | Object synchronization in a clustered system | |
CN111444148B (en) | Data transmission method and device based on MapReduce | |
CN110781159B (en) | Ceph directory file information reading method and device, server and storage medium | |
US9893936B2 (en) | Dynamic management of restful endpoints | |
CN111338834B (en) | Data storage method and device | |
WO2023071566A1 (en) | Data processing method and apparatus, computer device, computer-readable storage medium, and computer program product | |
US20230055511A1 (en) | Optimizing clustered filesystem lock ordering in multi-gateway supported hybrid cloud environment | |
Chen et al. | The research about video surveillance platform based on cloud computing | |
CN113032349B (en) | Data storage method, device, electronic equipment and computer readable medium | |
CN113220237B (en) | Distributed storage method, device, equipment and storage medium | |
US20180349210A1 (en) | Detecting deadlock in a cluster environment using big data analytics | |
CN114691720A (en) | Data query method, database system, readable medium and electronic device | |
CN113761548B (en) | Data transmission method and device for Shuffle process | |
CN111581173A (en) | Distributed storage method and device for log system, server and storage medium | |
Estrada et al. | The broker: Apache kafka | |
US11526490B1 (en) | Database log performance | |
CN115277610B (en) | Message split sending method, device, equipment and medium based on dual-activity environment | |
US11483381B1 (en) | Distributing cloud migration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu Applicant after: NANJING University Applicant after: Douyin Vision Co.,Ltd. Address before: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu Applicant before: NANJING University Applicant before: Tiktok vision (Beijing) Co.,Ltd. Address after: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu Applicant after: NANJING University Applicant after: Tiktok vision (Beijing) Co.,Ltd. Address before: 210023 163 Xianlin Road, Qixia District, Nanjing, Jiangsu Applicant before: NANJING University Applicant before: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |