WO2017113278A1 - 数据处理方法、装置和系统 - Google Patents

数据处理方法、装置和系统 Download PDF

Info

Publication number
WO2017113278A1
WO2017113278A1 PCT/CN2015/100081 CN2015100081W WO2017113278A1 WO 2017113278 A1 WO2017113278 A1 WO 2017113278A1 CN 2015100081 W CN2015100081 W CN 2015100081W WO 2017113278 A1 WO2017113278 A1 WO 2017113278A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
mapping
reduction
partition
data segment
Prior art date
Application number
PCT/CN2015/100081
Other languages
English (en)
French (fr)
Inventor
唐继元
王伟
蔡毅
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201580063657.5A priority Critical patent/CN108027801A/zh
Priority to PCT/CN2015/100081 priority patent/WO2017113278A1/zh
Priority to EP15911904.9A priority patent/EP3376399A4/en
Publication of WO2017113278A1 publication Critical patent/WO2017113278A1/zh
Priority to US16/006,503 priority patent/US10915365B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5011Pool
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the specific implementation of the first mapping node acquiring the at least one data segment according to the execution result of the mapping task is: the first mapping node overflows the data stored in the buffer area during the execution of the mapping task, and the overflowing writes an overflow Write a file, the design stores the overwritten write file to the first remote private partition of the first mapping node; after the first mapping node performs the mapping task, and then the first from the first mapping node
  • the remote private partition reads all overflow files written at different time points, and merges the data fragments included in all overflow files according to the key values corresponding to different reduction nodes, and merges to obtain all the reduction nodes. Different data segments processed.
  • the present application provides a computer device including a processor and a memory, the processor being coupled to a memory bus, the memory for storing computer instructions, when the computer device is running, The processor executes the computer instructions stored by the memory to cause the computer device to perform the data processing method described above and various possible designs described above.
  • An execution unit configured to store the file from the remote shared partition according to the response information And acquiring the first data segment, and performing a reduction task on the first data segment.
  • FIG. 2 is an exemplary flowchart of performing a Map/Reduce task based on a remote shared partition
  • FIG. 5 is an optional refinement flowchart of step A401 in FIG. 4;
  • Figure 6 is a basic flow chart for describing a data processing method from the perspective of a reduction node
  • FIG. 8 is a schematic diagram showing a logical structure of a data processing apparatus 801 that can be used as a mapping node;
  • the CPU included in the computer device may be a light core (such as an ARM series processor that only supports single-threading), may be a multi-core of the MIC (Many Integrated Core) architecture, and may be other cores with data processing capabilities.
  • a computer device may have one or more types of cores, and different computer devices may have different cores. Accordingly, the CPU pool may have one or more types of CPUs, such as a light core, an MIC.
  • the CPU in the CPU pool connected to the local memory can be accessed. Specifically, it is accessed through the memory bus; but the CPU in the CPU pool that is not connected to the local memory cannot access the local memory.
  • a computer device may have one or more I/O devices, such as disk I/O, network I/O, or other I/O devices. Accordingly, the I/O pool 104 includes I/O devices that may be of one or more types.
  • a controller connected to the CPU pool controls the allocation of CPU resources and schedules CPU resources to work in conjunction with other CPUs, memory, and I/O devices to perform system-assigned tasks.
  • the CPU pool may be one or more, such as 1 for the CPU pool 101 and the CPU pool 102; the CPUs in the CPU pool communicate with each other through a common controller; the CPUs in different CPU pools need to utilize the respective controllers.
  • a controller connected to the I/O pool 104 controls the allocation of I/O resources and schedules I/O resources.
  • the CPU in the CPU pool triggers a message carrying an I/O access request and passes the CPU.
  • the pool-connected controller sends the message to the controller connected to the I/O pool 104; the controller connected to the I/O pool 104 allocates I/O from the I/O pool 104 according to the I/O access request, through the assigned
  • the controller connected to the I/O pool 104 can allocate I/O according to the request of the CPU in the CPU pool, and store the data stored in the storage pool 103 based on the message communication of the controller connected to the storage pool 103.
  • the data is output to the external device through the allocated I/O, and/or the data of the external device acquired from the allocated I/O is written in the storage medium of the storage pool 103.
  • a controller connected to the storage pool 103 is configured to manage the storage resource, including allocating storage resources in the storage pool 103, and setting access rights for the allocated storage resources.
  • the controller of the storage pool 103 divides, from the storage pool 103, a remote shared partition shared by multiple CPUs in the CPU pool (including a CPU running a mapped node and a CPU running a reduction node).
  • the CPU accesses the remote shared partition by: the CPU sends a message carrying a partition access request to the controller of the CPU pool, and the message communication between the controller connected to the storage pool 103 and the controller of the CPU pool
  • the controller of the storage pool 103 receives the message carrying the partition access request and accesses the remote shared partition specified by the partition access request.
  • the implementation manner of the CPU accessing the remote shared partition is different from the implementation of accessing the local memory.
  • Multiple CPUs in the CPU pool may access the same remote shared partition, and the local memory is only accessible by the local CPU connected to the local memory. And the local CPU usually accesses the local memory directly through the memory bus.
  • the controller of the storage pool 103 can set access rights for accessing the remote shared partition for multiple CPUs in the CPU pool, and the access rights that can be set are any of the following: read-only permission, write-only permission, readable and writable permission, or Other permissions. If the controller of the storage pool 103 sets the read-only permission of the remote shared partition for a certain CPU, the CPU mounts the remote shared partition with read-only permission, and thus can pass the controller of the storage pool 103 from the remote The shared partition reads data. If the controller of the storage pool 103 sets the write-only permission of the remote shared partition for a certain CPU, the CPU mounts the remote shared partition with write-only permission, and then the controller of the storage pool 103 can The remote shared partition writes data.
  • the embodiment of the present invention is based on the architecture shown in FIG. 1. To perform mapping/reduction tasks on the architecture, the following actions are first required:
  • the first action is to determine a CPU from the CPU pool for running the process that implements the management node (master). At least two CPUs are determined from the CPU pool for running processes that implement multiple worker nodes.
  • mapping node and the reduction node are running on different CPUs.
  • Multiple mapping nodes can run on one or more CPUs; if multiple mapping nodes are running on the same CPU, the multiple mapping nodes are running as threads; if multiple mapping nodes are running on different CPUs, The multiple mapping nodes are run as processes.
  • multiple reduction nodes can run on one or more CPUs; if multiple reduction nodes are running on the same CPU, the multiple reduction nodes are run as threads; if multiple reduction nodes are running Each runs on a different CPU that runs as a process.
  • the management node applies for a shareable remote shared partition to the controller of the storage pool 103 through message communication; the controller of the storage pool 103 responds to the application, and divides the remote shared partition, and the controller of the storage pool 103
  • the communication node is fed back to the management node by message communication: the partition name of the remote shared partition, and the access right of accessing the remote shared partition.
  • the management node then notifies the mapping node: the partition name of the remote shared partition, and the access permission requested by the mapping node.
  • the management node then notifies the reduction node: the partition name of the remote shared partition, and the access right requested by the reduction node.
  • the mapping/reduction task is performed on the data set, and before the mapping task (Map task) is executed, the data set is divided into data fragments.
  • the specific dividing rules may be according to task requirements and / or execution efficiency determination, for example, the data set is divided into data by a value or a plurality of values in the range of 16MB to 64MB, which is not intended to limit the present invention.
  • a data fragment is used as an input of a mapping node, so that the mapping node performs a mapping task on the input data fragment.
  • the first embodiment and the second embodiment explain the method for performing the mapping/reduction task on the architecture.
  • the third embodiment extends the first embodiment and the second embodiment, and explains from the perspective of the mapping node.
  • the data processing method provided on the architecture; the fourth embodiment extends the first embodiment and the second embodiment, and explains the data processing method provided on the architecture from the perspective of the reduction node; the fifth embodiment explains from the perspective of the mapping node.
  • the data processing device corresponding to the method provided in the third embodiment; the sixth embodiment explains the data processing device corresponding to the method provided in the fourth embodiment from the perspective of the reduction node; the seventh embodiment explains the data provided in the third embodiment and the fourth embodiment.
  • a computer device for processing a method; Embodiment 8 illustrates a system built on the architecture.
  • step S201 A specific process for executing a Map/Reduce task in the case of introducing a remote shared partition is described in detail below with reference to FIG. 2, and the process includes the following steps: step S201, step S202, step S203, step S204, step S205, and step S206. .
  • the embodiment describes the process of executing the Map/Reduce task by dividing the data set into two data fragments (see FIG. 2), so the data fragments divided according to the data set are only two. It is intended to be illustrative and not limiting as to the embodiments.
  • one data slice is processed by the first mapping node
  • the other data slice is processed by the second mapping node.
  • the tube The node is assigned three reduction tasks, one reduction task is processed by one reduction node, so there are three reduction nodes in total, including the first reduction node, the second reduction node and the third reduction node, Figure 2 A first reduction node is shown, and FIG. 2 does not show a second reduction node and a third reduction node.
  • the management node after determining the two mapping nodes and the three reduction nodes, the management node notifies the two mapping nodes of the identifiers of the three reduction nodes and the key values (Key) for each of the three reduction nodes. .
  • the management node requests six remote shared partitions from the controller of the storage pool.
  • the specific rules for the management node to assign these six remote shared partitions to two mapping nodes and three reduction nodes are as follows:
  • the remote shared partition shared by the first mapping node includes a remote shared partition 1, a remote shared partition 2, a remote shared partition 3, and the first mapping node is allocated readable and writable for the three remote shared partitions.
  • the remote shared partition shared by the second mapping node includes a remote shared partition 4, a remote shared partition 5, and a remote shared partition 6, and the second mapping node is allocated readable and writable for the three remote shared partitions.
  • the remote shared partition shared by the first reduction node includes a remote shared partition 1 and a remote shared partition 4, and the first reduction node is allocated read permission of the remote shared partition 1 and the remote shared partition 4;
  • the remote shared partition shared by the second reduction node includes the remote shared partition 2 and the remote shared partition 5, and the remote reduction partition 2 and the remote shared partition are allocated to the second reduction node. 5 readable permissions;
  • the remote shared partition shared by the third reduction node includes the remote shared partition 3 and the remote shared partition 6, and the remote shared partition 3 and the remote share are allocated to the third reduction node. Readable permissions for partition 6.
  • Step S201 the first mapping node mounts the remote shared partition 1, the remote shared partition 2, and the remote shared partition 3 with readable and writable access rights. Specifically, the remote shared partition can be hanged by the mount command. Loaded.
  • the first mapping node performs a Map task on the input data fragments. During the execution of the Map task by the first mapping node, the execution result obtained by executing the Map task is sequentially written into the memory in chronological order. The ring memory buffer in the buffer pool, and then write the execution result in the ring memory buffer to the local disk or the storage pool.
  • Figure 2 only shows the overflow write to the storage pool, and each overflow writes an overflow write. file.
  • the specific overflow writing process is: splitting the execution result of the current desired overflow in the circular memory buffer according to the key value (Key) for the Reduce task, and splitting to obtain 3 (equal to the return)
  • the number of nodes is a data segment, here is an example in which the execution result obtained each time has a key value (Key) for each reduction node, of course, if the execution result is obtained a certain time If there is no key value (Key) for a certain reduction node, the data segment corresponding to the reduce node will not be split, and the number of data fragments obtained by splitting is less than 3; then according to the key
  • the value (Key) sorts each of the three data segments, and the three sorted data segments are overwritten into a local disk or a storage pool in the form of a file. It can be seen that the overflow file obtained by each overflow contains three sorted data segments.
  • the first mapping node obtains the sorted data segments processed by the three reduction nodes from each overflow file. Sorting, merging, and merging all sorted data segments having a key (Key) for a single reduction node to obtain a data segment processed by the single reduction node; and so on, the first mapping The node can sort and merge the data segments processed by the three reduction nodes respectively based on the sorted data segments in all overflow files. For example, the first mapping node separately acquires the data segments processed by the first reduction node from the three sorted data segments included in each overflow file according to the key value (Key) for the first reduction node. Then sorting and merging the data segments obtained from each overflow file according to a key value (Key) for a reduction node, to obtain a first data segment, where the first data segment refers to A data segment processed by a reduction node.
  • the first mapping node stores the three data segments in a file-by-file manner to the three remote shared partitions of the remote shared partition 1, the remote shared partition 2, and the remote shared partition 3; specifically, A file containing the first data segment (processed by the first reduction node) is stored to the remote shared partition 1, and a file containing the data segment processed by the second reduction node is stored to the remote shared partition 2, which will be included
  • the files of the data segments processed by the three reduction nodes are stored to the remote shared partition 3.
  • the three reduction nodes are each targeting a key (Key), and the remote shared partition 1,
  • the remote shared partition 2 and the remote shared partition 3 are correspondingly shared by the first reduction node, the second reduction node, and the third reduction node, so that the first mapping node can have keys according to the three data segments.
  • the value (Key) determines the remote shared partition of each of the three data segments.
  • the file name is set for the file; and the partition name and the file name of the remote shared partition are stored, and the storage location may be the first mapping node.
  • the local memory or local disk in the computer device of course, the storage location may also be other storage media outside the computer device.
  • the file names set for files in different remote shared partitions may be different or the same; if files in different remote shared partitions have the same file name, different partition names may be different according to different remote shared partitions.
  • the two files that distinguish the same file name are each associated with the corresponding reduction node.
  • the first message is sent to the management node, where the first message further carries the address information of the first mapping node, where the address information includes the address of the first mapping node and/or the first mapping.
  • the identity of the node is notified to the management node by the first message that the mapping node (the first mapping node) specified by the address information has executed the Map task assigned by the management node.
  • the remote shared partition 1, the remote shared partition 2, and the remote shared partition 3 are unloaded.
  • the first mapping node can uninstall the three remote shared partitions by using the unmount command.
  • the term "unloading" as used herein refers to deleting the corresponding partition device in the first mapping node, which is the opposite of the above mounting.
  • Step S203 the management node (not shown in FIG. 2) receives the first message, and determines, according to the address and/or the identity identifier in the first message, that the first mapping node has performed the assigned Map task; the management node determines the first After the mapping node has executed the assigned Map task, the second message carrying the address information of the first mapping node is sent to the first reduction node.
  • the first reduction node may also send a query request to the management node at an interval of the specific time; the management node responds to the latest query when receiving the first message triggered by the first mapping node performing the mapping task. The request sends a second message carrying the address information of the first mapping node to the first reduction node.
  • the management node After determining, by the first message, that the first mapping node has performed the assigned Map task, the management node also sends the address information of the first mapping node to the second reduction node and the third reduction node respectively. Message.
  • Step S204 the first reduction node receives the second message sent by the management node, and obtains the second message according to the second message. It is known that the first mapping node has executed the assigned Map task. After the first reduction node learns that the first mapping node has performed the assigned Map task, generates a data request message according to the address information of the first mapping node, and sends a data request message to the first mapping node, where the data request message carries There is an identifier of the first reduction node, which is used to distinguish the first reduction node from the other two reduction nodes; for example, the management node has previously numbered three reduction nodes, and the identifier of the first reduction node may be Refers to the number set by the management node for the first reduction node.
  • the second reduction node and the third reduction node learn, according to the message sent by the management node, that the first mapping node has performed the assigned Map task, and then requests the data of the first mapping node.
  • Step S205 the first mapping node receives the data request message sent by the first reduction node.
  • the first mapping node acquires the identifier of the first reduction node from the data request message, and matches the acquired identifier with the identifiers of the three reserved nodes that are pre-stored, and the matching identifies that the data request message is sent by the first reduction node.
  • the first mapping node responds to the data request message, and generates a response message according to the identifier of the first reduction node; the response message carries a partition name of the remote shared partition 1, and the response message further carries the The file name of the file of the first data segment.
  • the first mapping node sends the response message to the first reduction node.
  • the first mapping node feeds back the partition name of the remote shared partition 2 and includes the second return to the second reduction node.
  • the file name of the file of the data segment processed by the node, and the third reduction node feed back the partition name of the remote shared partition 3 and the file name of the file containing the data segment processed by the third reduction node.
  • Step S206 The first reduction node receives the response message sent by the first mapping node, and obtains the partition name of the remote shared partition 1 from the response message.
  • the first reduction node mounts the remote shared partition 1 with read-only permission according to the partition name of the remote shared partition 1, for example, the mount of the remote shared partition 1 by the mount instruction.
  • the first reduction node acquires the file name of the file containing the first data segment from the response message.
  • the first reduction node performs message communication with the controller of the storage pool; and the controller of the storage pool finds the first data segment from the remote shared partition 1 according to the obtained file name (ie, the file name carried by the response message) The file in which the first data segment is read from the file.
  • the first mapping node stores the data segments processed by the three reduction nodes in a one-to-one correspondence to the remote shared partition 1, the remote shared partition 2, and the remote shared partition 3, and the second mapping node performs the Map task.
  • the data segment including the data processed by the first reduction node is written into the remote shared partition 4 in the form of a file, and the data segment processed by the second reduction node is included.
  • the remote shared partition 5 is written in the form of a file
  • the data segment containing the data processed by the third reduction node is written to the remote shared partition 6 in the form of a file.
  • the first reduction node mounts the remote shared partition 4, and reads the data segment processed by the first reduction node from the file in the remote shared partition 4. Then, the first reduction node performs a Reduce task on the first data segment read from the remote shared partition 1 and the data segment read from the remote shared partition 4.
  • the first reduction node may merge the execution result obtained by performing the reduction task into a local storage medium (for example, a disk), or merge the obtained execution result into the storage pool;
  • a local storage medium for example, a disk
  • the obtained execution result may be merged and written to another storage medium, which is not limited herein.
  • the first reduction node After the first reduction node reads the data segment from the remote shared partition 1 and the remote shared partition 4, the remote shared partition 1 and the remote shared partition 4 are unloaded. For example, the first reduction node can use the unmount command to uninstall the remote end. Shared partition 1, remote shared partition 4.
  • the second reduction node mounts the remote shared partition 2 according to the partition name of the remote shared partition 2 fed back by the first mapping node, and the remote sharing is fed back according to the second mapping node.
  • the partition name of the partition 5 mounts the remote shared partition 5, and reads the data segment processed by the second reduction node from the remote shared partition 2 and the remote shared partition 5 respectively; after the second reduction node reads the data segment
  • the remote shared partition 2 and the remote shared partition 5 are unloaded; the second reduction node performs auce task on the data segments respectively read from the remote shared partition 2 and the remote shared partition 5, and merges the execution result (merge) Write to storage media (such as local disks, storage pools).
  • the third reduction node mounts the remote shared partition 3 according to the partition name of the remote shared partition 3 fed back by the first mapping node, and mounts the remote end according to the partition name of the remote shared partition 6 fed back by the second mapping node.
  • the shared partition 6 reads the data segment processed by the third reduction node from the remote shared partition 3 and the remote shared partition 6, respectively; after the third reduction node reads the data segment, the remote shared partition 3 and the remote are unloaded. End sharing partition 6; the third reduction node performs a reduction task on the data segments respectively read from the remote shared partition 3 and the remote shared partition 6, and merges the execution result into a storage medium (for example, a local disk) , storage pool).
  • a storage medium for example, a local disk
  • the mapping node stores the data segment in the remote shared partition
  • the reduction node can directly Obtaining the data segment processed by itself from the remote shared partition, thereby eliminating the prior art that the reduction node first reads the data segment from the local disk and then flows the TCP segment to the destination node when requesting the data segment from the mapping node. These steps are performed by the node to send the read data segment, which effectively saves the time required to execute the Map/Reduce task.
  • the mapping node and the reduction node access the remote shared partition to read/write data faster than the speed of reading/writing data to the local disk, further shortening The time to perform the mapping/reduction task.
  • the data set is stored and managed by using a Hadoop Distributed File System (HDFS); therefore, the data set is divided into two data fragments, and FIG. 3 is a data.
  • the slice 301 and the data slice 302 are also stored in the HDFS. It should be noted that the data fragments divided according to the data set are only two, and the divided data fragments may be one or more, so the number of data fragments is not limited to the embodiment.
  • HDFS Hadoop Distributed File System
  • the management node (not shown in FIG. 3) determines two mapping nodes from the idle working nodes, as shown in FIG. 3 as a first mapping node and a second mapping node, and the first mapping node performs a Map task on the data fragment 301. Performing a Map task on the data fragment 302 by the second mapping node; in addition, the management node determines three reduction nodes from the idle working nodes, as shown in FIG. 3 as a first reduction node, a second reduction node, and a third reduction node; after determining the two mapping nodes and the three reduction nodes, the management node identifies the identifiers of the three reduction nodes and the key values (Key) for the three reduction nodes Notify these two mapping nodes.
  • Key key values
  • the management node requests six remote shared partitions from the controller of the storage pool, including the remote shared partition 1, the remote shared partition 2, the remote shared partition 3, the remote shared partition 4, and the far End shared partition 5, remote shared partition 6.
  • the controller of the storage pool sets the sharing rights for the six remote shared partitions, where the remote shared partition 1 shares the first mapping node and the first reduction node, and the remote shared partition 2 pairs the first mapping node. Sharing with the second reduction node, the remote shared partition 3 is shared by the first mapping node and the third reduction node, the remote shared partition 4 is shared by the second mapping node and the first reduction node, and the remote shared partition 5 is shared.
  • the second mapping node is shared with the second reduction node, and the remote shared partition 6 is shared by the second mapping node and the third reduction node.
  • the management node also applies separately for each mapping node and each reduction node from the controller of the storage pool.
  • the remote private partition is allocated, and the first mapping node is assigned a private right to access the remote private partition 331, the second mapping node is assigned a private right to access the remote private partition 332, and the first reduction node is assigned access.
  • the private rights of the remote private partition 333, the second reduction node is assigned a private right to access the remote private partition 334, and the third reduction node is assigned a private right to access the remote private partition 335.
  • the remote private partition is exclusive, that is, non-shared, and the node with the private right can access the remote private partition.
  • the first mapping node is assigned the private right of the remote private partition 331, the first mapping.
  • the node is able to access the remote private partition 331, but other mapping nodes or reduction nodes are not able to access the remote private partition 331.
  • the implementation of the flow of executing the Map/Reduce task in the second embodiment is similar to the implementation of the Map/Reduce task in the first embodiment, but there are three different points, including a first difference point, a second difference point, and a third difference point.
  • the first difference is that the first mapping node mounts the remote private partition 331.
  • the first mapping node performs a Map task on the data fragment 301, and caches the execution result of the execution of the Map task into a buffer, and overflows the execution result of the buffer cache to the remote private partition 331, each time overflowing Get an overflow file.
  • the first mapping node reads the overflow write file from the remote private partition 331 at different time points; based on the data in all the overflow files read.
  • Segment merge to obtain a first data segment processed by the first reduction node, a data segment processed by the second reduction node, and a data segment processed by the third reduction node, where merge is implemented
  • the implementation is the same as the implementation of the first mapping node merged into three data segments in the first embodiment.
  • the details of implementing the merge here can refer to the first mapping node in the first embodiment to write all the overflow files.
  • the specific implementation of the data segment that is processed by the three reduction nodes is merged, and is not described here.
  • the first mapping node stores the file containing the first data segment to the remote shared partition 1, and stores the file containing the data segment processed by the second reduction node to the remote shared partition 2, which will contain the third reduction node
  • the file of the processed data segment is stored to the remote shared partition 3.
  • the second mapping node mounts the remote private partition 332; the second mapping node uses the remote private partition 332 to store the overflow write file and the overflow private file from the remote private partition 332 to be implemented in an similar manner to the first mapped node. , will not repeat them here.
  • the second mapping node stores the file containing the data segment processed by the first reduction node to the remote shared partition 4, which will be included by the second reduction node The file of the processed data segment is stored to the remote shared partition 5, and the file containing the data segment processed by the third reduction node is stored to the remote shared partition 6.
  • the data set is very large, and correspondingly requires more storage space to store the overflow file obtained by performing the mapping task; in this case, the storage space of the local disk is limited, but the storage is limited.
  • the pool has a large enough storage space to store the overflow file of the remote private partition divided from the storage pool. Since the remote private partition is exclusive, the overflow file is not illegally modified, and the overflow file is guaranteed. Security. If the remote private partition is specifically divided from the memory pool, compared to the prior art, the mapping node and the reduction node access the private shared partition to read/write data faster than the local disk to read/write data, further Reduced the time to perform mapping/reduction tasks.
  • the second difference is that the first reduction node mounts the remote private partition 333.
  • the first reduction node mounts the remote shared partition 1 and the remote shared partition 4, and reads the data segment processed by the first reduction node from the remote shared partition 1 and the remote shared partition 4, respectively; the first reduction node
  • the local memory is used to store the data segments read from the remote shared partition 1 and the remote shared partition 4. If the local memory is insufficient, the remaining data segments (including subsequent read from the remote shared partition 1 and the remote shared partition 4) are read.
  • the fetched data segment is stored in the remote private partition 333.
  • the remote private partition 333 can also be used to store data segments read from the remote shared partition 1 and the remote shared partition 4 without using local memory storage.
  • the second reduction node mounts the remote private partition 334.
  • the second reduction node mounts the remote shared partition 2 and the remote shared partition 5, and reads the data segment processed by the second reduction node from the remote shared partition 2 and the remote shared partition 5, respectively; the second reduction node
  • the data segment read from the remote shared partition 2 and the remote shared partition 5 is still first used by the local memory storage. If the memory is insufficient, the remaining data segments (including subsequent reads from the remote shared partition 2 and the remote shared partition 5 are read).
  • the fetched data segment is stored in the remote private partition 334 of the second reduction node.
  • the remote private partition 334 can also be used to store data segments read from the remote shared partition 2 and the remote shared partition 5 without using local memory storage.
  • the third reduction node mounts the remote private partition 335.
  • the third reduction node mounts the remote shared partition 3 and the remote shared partition 6, and reads the data segment processed by the third reduction node from the remote shared partition 3 and the remote shared partition 6, respectively; the third reduction node
  • the data segment read from the remote shared partition 3 and the remote shared partition 6 is still first used by the local memory storage, and if the local memory is insufficient, the remaining The data segments (including subsequent data segments read from the remote shared partition 3 and the remote shared partition 6) are stored in the remote private partition 335 of the third reduction node.
  • the remote private partition 335 can also be used to store data segments read from the remote shared partition 2 and the remote shared partition 5 without using local memory storage.
  • the remote private partition is used to store the data segment, which can increase the data volume of the data segment that can be processed by the reduction node. Since the remote private partition is exclusive, the data segment is prevented from being illegally modified. If the remote private partition is specifically divided from the memory pool, compared to the prior art, the mapping node and the reduction node access the private shared partition to read/write data faster than the local disk to read/write data, further Reduced the time to perform mapping/reduction tasks.
  • the third difference is that the result of the execution of the reduction task on the data segment by the first reduction node is stored in the form of a file to the storage area 321 in the HDFS, and the storage area 321 is a storage space in the HDFS.
  • the result of the execution of the reduction task on the data segment is stored in the storage area 322 in the HDFS in the form of a file.
  • the storage area 322 is a storage space in the HDFS.
  • the third reduction node stores the result of the execution of the reduction task on the data segment, and stores it in the form of a file to the storage area 323 in the HDFS.
  • the storage area 323 is a storage space in the HDFS.
  • the remote shared partition 1 may be used instead of the remote private partition 331 to store the overwritten file overflowed by the first mapping node. After the mapping node performs the Map task, all the overflow files are obtained from the remote shared partition 1.
  • the remote shared partition 2 can be used to replace the remote private partition 332 to store the overflow write file overflowed by the first mapping node. At this time, the first mapping node is executed. The Map task is to get all overflow files from the remote shared partition 2.
  • the role of the remote private partition in the second embodiment in the second embodiment can also be replaced by using other storage spaces in the storage pool.
  • the controller of the storage pool is notified to reclaim the remote shared partition; the controller of the storage pool reclaims the remote shared partition, and the remote end may be the remote
  • the shared partition is allocated to other task nodes (such as other mapping nodes that perform Map/Reduce tasks), which improves the utilization of the remote shared partition.
  • the mapping node is privately assigned from the far end.
  • the controller of the storage pool After the area reads the overflow file, the controller of the storage pool is notified to reclaim the remote private partition; the controller of the storage pool reclaims the remote private partition, and the remote private partition can be subsequently assigned to other task nodes (for example, other The mapping node that performs the Map/Reduce task is used to improve the utilization of the remote private partition.
  • the controller of the storage pool is notified to reclaim the remote private partition; the controller of the storage pool reclaims the remote private partition, and the remote end can be privately owned. Partitions are allocated to other task nodes (such as other mapping nodes that perform Map/Reduce tasks), which improves the utilization of remote private partitions.
  • an optional refinement is performed on the overflow writing process in Embodiment 1 and/or Embodiment 2.
  • the refinement is detailed as follows:
  • the mapping node caches the execution result of the Map task into the ring memory buffer in time sequence.
  • the usage rate of the ring memory buffer reaches 80%
  • the buffer of the ring memory buffer is triggered.
  • the mapping node can continue to write the execution result of the Map task to the unused buffer, and does not stop the mapping node from continuing to output the execution of the Map task. As a result, the purpose of not performing the Map task is not stopped.
  • the buffer used to cache the execution result of the Map task can be reused to cache the execution result of the Map task.
  • the storage capacity of the storage pool is variable, and usually has a large amount of memory space. It is not practical to manage the storage pool through a single file system, so that the controller of the storage pool is from the storage pool according to requirements in the embodiment of the present invention.
  • the partition is divided into a remote shared partition and a remote private partition; each partition separately sets a file system, and each partition is named such that each partition has a partition name for distinguishing from each other; the partition can be like other storage
  • the device is mounted/unloaded in the same manner as the hardware device is mounted in the Linux system and mapped to the file in the system.
  • the embodiment of the present invention can directly mount the partition in the storage pool according to the partition name, and access the For files in the partition, access rights can be set by parameters when mounted.
  • mapping node runs on the CPU in the decoupled CPU pool, but the reduction node does not Is running on the CPU in the decoupled CPU pool, the mapping node still stores the data segment to the remote shared partition; then, the reduction node communicates with the mapping node to establish a TCP flow; and then, the mapping node shares the partition from the remote end A data segment processed by the reduction node is read, and the read data segment is sent to the reduction node through the TCP flow. Therefore, the implementation of the Map/Reduce task provided by the embodiment may be compatible with the implementation of the Map/Reduce task in the prior art, but the location of the stored data segment is different.
  • the rate of accessing the remote shared partition to read the data segment is greater than the rate at which the local disk accesses the local disk to read the data segment, which can reduce the execution time of executing the Map/Reduce task.
  • Embodiment 3 correspondingly expands the above two embodiments, and provides a basic workflow of the data processing method from the perspective of the mapping node; the system architecture of the system to which the basic workflow is applicable is shown in the system architecture shown in FIG.
  • the system includes a CPU pool and a storage pool, the CPU pool being communicatively coupled to the storage pool.
  • the CPU pool includes at least two CPUs, and at least one mapping node and at least one reduction node are running on the CPU pool. It can be seen that the mapping nodes in the first embodiment and the second embodiment are only two examples, and the first embodiment and the The reduction nodes in the second embodiment are three and are merely examples.
  • the at least one mapping node includes a first mapping node, the first mapping node is any one of the at least one mapping node; the at least one reduction node includes a first reduction node, and the first reduction node is Any one of the at least one protocol node; the first mapping node and the first reduction node running on different CPUs in the CPU pool, the first mapping node and the first Message communication between a reduction node is accomplished by a controller of the CPU pool forwarding a message, such as a data request message.
  • the remote shared partition included in the storage pool is shared by the first mapping node and the first reduction node.
  • the storage pool is a memory pool.
  • the management node requests the first mapping node and the first reduction node to apply for access to the remote shared partition and the remote shared partition from the controller of the storage pool, and allocate the remote shared partition to the first mapping node.
  • the allocation refer to the descriptions of the first and second embodiments for allocating remote shared partitions.
  • the remote shared partition may be shared by all the mapping nodes in the at least one mapping node, or may be shared by all the reduction nodes in the at least one reduction node, or may be only the first
  • the mapping node is shared with the first reduction node, but is at least shared by the first mapping node and the first reduction node.
  • the first mapping node and the first reduction node access the remote shared partition by using a mount mode.
  • the basic workflow shown in FIG. 4 is given from the perspective of the first mapping node.
  • the basic workflow provided in FIG. 4 includes: step A401, step A402, step A403, step A404, and step A405.
  • the data set is divided into one or more data fragments, and the specific division mode participates in the related description of the fourth action (the fourth action performed on the mapping/reduction task on the architecture); each data
  • the size of the fragments can be different or the same.
  • a data fragment is used as input to a mapping node, and a mapping node performs a mapping task on a data fragment.
  • the first mapping node performs a mapping task on the data fragment (ie, the Map task described in the first embodiment and the second embodiment), and acquires at least one data segment according to the execution result of the mapping task, the at least one Each data segment in the data segment is processed by a corresponding reduction node, wherein the at least one data segment includes a first data segment, and the first data segment refers to being processed by the first reduction node Data segment
  • Step A402 the first mapping node stores the first data segment in a file format in the remote shared partition; specifically, the first mapping node creates a new file in a remote shared partition, and A data segment is written to the file.
  • Step A403 the first mapping node receives a data request message sent by the first reduction node, where the data request message includes an identifier of the first reduction node;
  • Step A404 the first mapping node responds to the data request message, and generates a response message according to the identifier of the first reduction node, where the response message includes a partition for storing a remote shared partition of the first data segment. a name and a file name of a file containing the first data segment; if the remote shared partition is shared with a plurality of mapping nodes of the at least one mapping node and/or for the at least one reduction node The file name of each file containing the data segment stored in the remote shared partition is different, and the file containing the data segment is distinguished by the file name;
  • Step A405 the first mapping node feeds back the response message to the first reduction node, And causing the first reduction node to acquire the first data segment from a file having the file name in a remote shared partition having the partition name according to the response information, and perform reduction on the first data segment Tasks (ie, the Reduce tasks described in Embodiment 1 and Embodiment 2).
  • Tasks ie, the Reduce tasks described in Embodiment 1 and Embodiment 2.
  • the step A401 For the first mapping node to perform the implementation details of the step A401, the step A402, the step A403, the step A404, and the step A405, refer to the corresponding description of the first mapping node in the first embodiment and the second embodiment.
  • a non-shared storage space is allocated to each mapping node from the storage pool, where the private storage space of each mapping node is defined as a first remote private partition, so each mapping node has exclusive
  • the first remote private partition the visible storage pool includes a first remote private partition that is exclusive to the first mapping node.
  • an optional refinement is performed on the step A401 from the perspective of how to obtain the first data segment. Referring to FIG. 5, the first mapping node performs according to the mapping task. As a result, at least one data segment is obtained, which specifically includes step A501 and step A502.
  • Step A501 Store the overflow write file overflowed by the first mapping node when performing the mapping task to the first remote private partition of the first mapping node, where the single overwrite file includes the first mapping And performing, by the node, the data written out from the buffer that caches the execution result of the mapping task in the process of performing the mapping task;
  • Step A502 The plurality of overflow files stored in the first remote private partition of the first mapping node are respectively combined according to key values corresponding to different reduction nodes in the at least one reduction node, Merging results in the at least one data segment.
  • the first mapping node in the second embodiment uses the remote private partition 331 to store the corresponding description of the overflow file.
  • the number of the remote shared partition is equal to a product of the number of the mapping nodes and the number of the reduction nodes, and each of the remote shared partitions is configured by one mapping node and one
  • the first embodiment and the second embodiment are respectively illustrated as six remote shared partitions, including a remote shared partition 1, a remote shared partition 2, a remote shared partition 3, and a remote shared partition 4.
  • the shared partition 3 is shared by the first mapping node and the third reduction node
  • the remote shared partition 4 is shared by the second mapping node and the first reduction node
  • the remote shared partition 5 is paired with the second mapping node and the second reduction node. Sharing, the remote shared partition 6 is shared by the second mapping node and the third reduction node.
  • an optional refinement is performed on how to implement storing the data segment in the corresponding remote shared partition, and the first mapping node stores the data segment in a file format.
  • the remote shared partition including:
  • the first mapping node stores a file containing the first data segment to a remote shared partition shared by the first mapping node and the first reduction node.
  • the first mapping node stores the file containing the first data segment to the specific implementation details of the remote shared partition shared by the first mapping node and the first reduction node. For example, refer to Embodiment 1 and The first mapping node in the second embodiment will include implementation details of storing the first data segment in the remote shared partition 1. Since a remote shared partition stores a file containing data segments, the file names of different files can be the same or different.
  • Embodiment 4 correspondingly expands the foregoing Embodiment 1 and Embodiment 2, and provides a basic workflow of the data processing method from the perspective of the reduction node; the system applicable to the basic workflow and the third embodiment are provided from the perspective of the mapping node.
  • the system to which the basic workflow of the data processing method is applied is the same system, and will not be described here.
  • the basic workflow shown in FIG. 6 is given from the perspective of the first reduction node, and the basic workflow provided in FIG. 6 includes: step A601, step A602, and step A603.
  • Step A601 the first reduction node learns that the first mapping node stores the first data segment to the remote shared partition, and sends a data request message to the first mapping node, where the data request message includes An identifier of the first reduction node, where the first data segment refers to the first return in at least one data segment acquired by the first mapping node according to an execution result obtained by the first mapping node The data segment processed by the node;
  • Step A602 the first reduction node receives a response message fed back by the first mapping node,
  • the response message includes a partition name of a remote shared partition storing the first data segment and a file name of a file including the first data segment, and the response message is responsive to the data by the first mapping node Generating a message according to the identifier of the first reduction node;
  • Step A603 the first reduction node acquires the first data segment from the file stored in the remote shared partition according to the response information, and performs a reduction task on the first data segment (ie, implements The Reduce task described in Example 1 and Embodiment 2).
  • the step A602 and the step A603 refer to the corresponding description of the first reduction node in the first embodiment, the second embodiment, and the third embodiment.
  • the number of the remote shared partition is equal to a product of the number of the mapping nodes and the number of the reduction nodes, and each of the remote shared partitions is configured by one mapping node and one
  • the first embodiment and the second embodiment are respectively illustrated as six remote shared partitions, including a remote shared partition 1, a remote shared partition 2, a remote shared partition 3, and a remote shared partition 4.
  • the shared partition 3 is shared by the first mapping node and the third reduction node
  • the remote shared partition 4 is shared by the second mapping node and the first reduction node
  • the remote shared partition 5 is paired with the second mapping node and the second reduction node. Sharing, the remote shared partition 6 is shared by the second mapping node and the third reduction node.
  • the first reduction node in step A603 obtains the first data segment from the file stored in the remote shared partition according to the response information, as shown in FIG. Step A701 and step A702 are included.
  • Step A701 the first reduction node determines, according to the partition name in the response information, a remote shared partition shared by the first mapping node and the first reduction node;
  • Step A702 the first reduction node reads the number according to the file name in the response information from a remote shared partition shared by the first mapping node and the first reduction node.
  • a data segment A data segment.
  • the first reduction node may refer to the response information from the first mapping node and the first return node according to the first embodiment and the second embodiment.
  • a detailed description of the first data segment is read in the remote shared partition shared by the node.
  • a non-shared storage space is allocated to each of the reduction nodes from the storage pool, where the storage space private to each reduction node is defined as a second remote private partition, so each reduction node has An exclusive second remote private partition, the visible storage pool including a second remote private partition that is exclusive to the first reduction node.
  • the method includes storing data segments based on using a second remote private partition, the method comprising:
  • the first reduction node stores the first data segment read from the remote shared partition in the local memory, and when the usage rate of the local memory reaches a preset usage rate, The first data segment read from the remote shared partition is stored to a second remote private partition of the first reduction node.
  • the first reduction node accesses the remote shared partition by using a mount mode.
  • the storage pool is a memory pool.
  • the data processing apparatus provided in the fifth embodiment includes the apparatus unit that can implement the method flow provided by the third embodiment; in view of the space, the actions performed by the functional units provided in the fifth embodiment are not described in detail, and may be directly Refer to the corresponding action descriptions provided in the method flow provided by the first embodiment, the second embodiment and the third embodiment.
  • An obtaining unit 802 configured to perform a mapping task on the data fragment, according to the mapping task
  • the row result obtains at least one data segment, each of the at least one data segment being processed by a corresponding reduction node, wherein the at least one data segment comprises a first data segment, the first data segment Means a data segment processed by the first reduction node;
  • the storage unit 803 is configured to store the first data segment in a file format in the remote shared partition
  • the obtaining unit 802 is configured to store an overflow write file overflowed by the first mapping node when performing a mapping task to a first remote private partition of the first mapping node, where the single overflow file includes And executing, by the first mapping node, data that is written out from a buffer that caches an execution result of the mapping task in the mapping task process;
  • the obtaining unit 802 configured to: the plurality of the overwrite files stored in the first remote private partition of the first mapping node, according to different ones of the at least one reduction node The key values are combined separately, and the at least one data segment is obtained by combining.
  • the storage pool is a memory pool.
  • the first mapping node provided here has the same function as the first mapping node provided by the foregoing method embodiment (the first embodiment, the second embodiment, the third embodiment, and the fourth embodiment), and performs the same function in principle;
  • the first reduction node and the first reduction node provided by the foregoing method embodiment have the same functions and perform the same function in principle;
  • the specific functions and operations that can be performed on the first mapping node and the first reduction node in the eighth embodiment are not described in detail, and can be directly referred to the first embodiment, the second embodiment, and the embodiment.
  • the first mapping node is implemented by the data processing apparatus of Embodiment 5
  • the first reduction node is implemented by the data processing apparatus of Embodiment 6.
  • the storage pool includes a first remote private partition that is exclusive to the first mapping node

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据处理方法、装置和系统,所述方法适用的系统包括中央处理器(Central Processing Unit,简称CPU)池和存储池;映射节点和归约节点分别运行在CPU池中的不同CPU上;从存储池划分有对映射节点和归约节点共享的远端共享分区。所述方法中,映射节点执行映射任务,并将执行映射任务所得的数据段存储至远端共享分区;归约节点直接从远端共享分区获取由其处理的数据段,对该数据段执行归约任务。所述方法省去了现有技术的部分动作,缩短了执行映射/归约任务的时间,省去的部分动作包括:映射节点将数据段通过磁盘输入/输出(Input/Output,简称I/O)写入本地磁盘,在归约节点请求该数据段时再通过磁盘I/O从本地磁盘读取该数据段。

Description

数据处理方法、装置和系统 技术领域
本发明实施例涉及计算机领域,尤其涉及数据处理方法、装置和系统。
背景技术
映射/归约(Map/Reduce),是一种编程模型,用于大规模数据集的并行运算,如大于1万亿字节(英文全称Terabyte,英文简称TB)的数据集的并行运算。
在处理数据集时,将数据集划分为多个数据分片,由管理节点(master)调度工作节点(worker)来处理各个数据分片。master为空闲的worker分配映射任务(Map任务),分配到Map任务的worker成为映射节点。另外,master为空闲的其它worker分配归约任务(Reduce任务),分配到Reduce任务的worker成为归约节点;映射节点将执行Map任务的结果暂存在环形内存缓存区中,再通过磁盘输入/输出(英文全称Input/Output,英文简称I/O)将环形内存缓存区中的结果溢写到磁盘中,每次溢写得到一个溢写文件;溢写生成溢写文件的过程中,根据各个归约节点处理的键值(Key)分别对环形内存缓存区中的结果进行拆分(partition)、排序(sort)。待映射节点执行完Map任务,再读取磁盘中的溢写文件并进行合并(merge),合并成一个文件,并将合并成的文件再次写入磁盘;因此,拆分(partition)、排序(sort)和合并(merge)的过程中会多次利用磁盘I/O进行磁盘读写。映射节点执行完Map任务时通知master,master再转通知归约节点:该映射节点的身份标识;归约节点根据身份标识向该映射节点请求数据,映射节点与归约节点建立传输控制协议(英文全称Transmission Control Protocol,英文简称TCP)流,映射节点从磁盘存储的文件中读取出由该归约节点处理的数据,并通过该TCP流向该归约节点发送读取到的该数据。映射节点通过该TCP流向归约节点发送数据的过程中,映射节点需要利用磁盘I/O从磁盘读取数据,并利用网络I/O来向归约节点传输 载有该数据的TCP流,而利用磁盘I/O进行磁盘读写和利用网络I/O完成向归约节点的数据传输,非常耗时,延长了完成Map/Reduce任务的执行时间。
发明内容
有鉴于此,本申请提供了一种数据处理方法、装置和系统,可以减少执行Map/Reduce任务的执行时间。
一方面,本申请提供一种数据处理方法,所述方法适用的系统包括存储池,从存储池中划分了远端共享分区;所述方法包括:映射节点执行映射任务,并将执行映射任务所得的数据段存储至远端共享分区,归约节点直接从远端共享分区获取该数据段,对该数据段执行归约任务。此处的映射节点和归约节点可以是运行同一个中央处理器(英文全称Central Processing Unit,英文简称CPU)上,也可以是运行在不同CPU上。并且,所述CPU可能属于非解耦的计算机设备,也可能属于CPU池。
可见,本申请提供的方法省去了现有技术的部分动作,缩短了执行映射/归约任务的时间,省去的部分动作包括:映射节点将映射任务的执行结果通过磁盘I/O写入本地磁盘,在归约节点请求该执行结果时再通过磁盘I/O从本地磁盘读取该映射任务的执行结果,并通过网络I/O向归约节点发送载有该执行结果的TCP流。
在一个可能的设计中,所述方法适用的系统包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接,具体可以通过CPU池的控制器与存储池的控制器之间的消息通信来实现所述CPU池与所述存储池的通信连接。CPU池还具有多个CPU,CPU池内的两个CPU之间的通信是通过CPU池的控制器实现。映射节点和归约节点运行在CPU池内的不同CPU上。
映射节点为多个,此处用第一映射节点表示多个映射节点中的任一个映射节点;归约节点也为多个,此处用第一归约节点表示多个归约节点中的任一个归约节点。下面以第一映射节点和第一归约节点为例说明本申请的方法如下。
所述第一映射节点对数据分片执行映射任务,根据所述映射任务的执行结果获取至少一个数据段。对于所述至少一个数据段中的每个数据段,一个由一个归约节点处理,并且不同数据段由不同归约节点处理,其中,第一数据段是指所述至少一个数据段中由第一归约节点处理的数据段。
所述第一映射节点将数据段在所述远端共享分区中存储,实现存储的方式是:在所述远端共享分区为所有数据段各自建立一个文件,并将数据段写入为其建立的文件中。因此,在所述远端共享分区中存储由包含所述第一数据段的文件。
所述第一归约节点获知所述第一映射节点将第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息。
所述第一映射节点接收所述第一归约节点发送的数据请求消息,响应所述数据请求消息,具体响应方式可以是根据所述数据请求消息携带的所述第一归约节点的标识生成响应消息,向所述第一归约节点反馈所述响应消息。
所述第一归约节点接收所述第一映射节点反馈的响应消息。由于所述响应消息携带有存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,所述第一归约节点根据所述响应信息携带的分区名称挂载远端共享分区,并根据所述响应信息携带的文件名从远端共享分区查找到包含所述第一数据段的文件,从该文件中读取第一数据段。以此类推,各个映射节点会将由第一归约节点处理的数据段存储至远端共享分区,因此第一归约节点可以从远端共享分区获取到所有由其处理的数据段(包括第一数据段),对获取到的所有数据段执行归约任务。
在本设计中,映射节点与归约节点是运行在CPU池中的不同CPU上,通过消息进行通信;相对于在非解耦的CPU上执行映射/归约任务,如果CPU池具有海量CPU资源,本申请可以分配更多CPU来执行映射/归约任务,可以加快映射/归约任务的时间,还可以支持一次性对更大的数据集执行映射/归约任务。
在一个可能的设计中,第一映射节点在完成映射任务的执行后,会将数据段存储至远端共享分区,并会通过消息通知管理节点第一映射节点已完成映射任务;第一归约节点再从管理节点获知第一映射节点已完成映射任务, 在获知第一映射节点已完成映射任务的情况下,便获知所述第一映射节点已将第一数据段存储到所述远端共享分区,从而向第一归约节点发送数据请求消息。
在一个可能的设计中,存储池具体为内存池,相对于现有技术,映射节点和归约节点向远端共享分区(属于存储池)读/写数据的速度快于向本地磁盘读/写数据的速度,进一步缩短了执行映射/归约任务的时间。
在一个可能的设计中,映射节点和归约节点均使用挂载方式访问所述远端共享分区,访问完后,均使用卸载方式来停止对所述远端共享分区的访问。
在一个可能的设计中,所述存储池包括被第一映射节点独享的第一远端私有分区,即其他映射节点或者归约节点不能访问该第一远端私有分区。
第一映射节点根据所述映射任务的执行结果获取至少一个数据段的具体实现是:第一映射节点在执行映射任务的过程中会对缓存区存储的数据进行溢写,一次溢写得到一个溢写文件,本设计将溢写所得的溢写文件存储至所述第一映射节点的第一远端私有分区;待第一映射节点执行完映射任务,再从所述第一映射节点的第一远端私有分区中读取不同时间点溢写的所有溢写文件,并对所有溢写文件所包含的数据分片根据不同归约节点所对应的键值分别合并,合并得到所有归约节点各自处理的不同数据段。
可选地,使用远端共享分区存储溢写文件来替代使用第一远端私有分区存储溢写文件,使用远端共享分区存储溢写文件的实现方式与使用第一远端私有分区存储溢写文件的实现方式类似,在此不再赘述。
在一个可能的设计中,所述远端共享分区为多个,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积;每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;不同所述远端共享分区,被共享的映射节点不同,或者被共享的归约节点不同,或者被共享的映射节点和归约节点均不同。其中,有一个远端共享分区是被共享给第一映射节点和第一归约节点的,相应地,第一映射节点是将包含所述第一数据段的文件存储至对所述第一映射节点和所述第一归约节点共享的远端共享分区 的,所述第一归约节点是从对所述第一映射节点和所述第一归约节点共享的远端共享分区读取第一数据段的。
在一个可能的设计中,所述存储池包括被第一归约节点独享的第二远端私有分区,即其他归约节点或者映射节点不能访问该第二远端私有分区。
本设计中可选地,所述第一归约节点从所述远端共享分区读取第一数据段,首先将读取的第一数据段存储在本地内存,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
本设计中可选地,所述第一归约节点从所述远端共享分区读取第一数据段,将读取的第一数据段存储至第一归约节点的第二远端私有分区。
又一方面,本申请提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述处理器与存储器总线连接,所述存储器用于存储计算机指令,当所述计算机设备运行时,所述处理器执行所述存储器存储的所述计算机指令,以使所述计算机设备执行上述的数据处理方法及上述的各个可能设计。
又一方面,本申请提供了一种数据处理装置,所适用的系统与上述方法所适用的系统相同。所述装置用于实现第一映射节点,所述装置包括获取单元、存储单元和响应单元。
获取单元,用于对数据分片执行映射任务,根据所述映射任务的执行结果获取至少一个数据段,所述至少一个数据段中的每个数据段由一个与之相对应的归约节点处理,其中,所述至少一个数据段包括第一数据段,第一数据段是指由所述第一归约节点处理的数据段。
存储单元,用于将所述第一数据段以文件格式存储在所述远端共享分区中。
响应单元,用于接收所述第一归约节点发送的数据请求消息,所述数据请求消息包含所述第一归约节点的标识,响应所述数据请求消息并根据所述第一归约节点的标识生成响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,向所述第一归约节点反馈所述响应消息。
在一个可能的设计中,所述存储池包括被第一映射节点独享的第一远端 私有分区。
所述获取单元,用于根据所述映射任务的执行结果获取至少一个数据段,具体包括:所述获取单元,用于将所述第一映射节点执行映射任务时溢写的溢写文件存储至所述第一映射节点的第一远端私有分区,其中,单个所述溢写文件包括所述第一映射节点执行所述映射任务过程中单次从缓存所述映射任务的执行结果的缓冲区中溢写出的数据;对所述第一映射节点的第一远端私有分区中存储的多个所述溢写文件根据所述至少一个归约节点中不同所述归约节点所对应的键值分别合并,合并得到所述至少一个数据段。
在一个可能的设计中,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享。
所述存储单元,用于将所述数据段以文件格式存储在远端共享分区中,具体包括:所述存储单元,用于将包含所述第一数据段的文件,存储至被所述第一映射节点和所述第一归约节点共享的远端共享分区。
应知,本方面及细节设计所提供的数据处理装置,该装置包括的各个单元可用于实现上述方法中第一映射节点所具有的功能。
又一方面,本申请提供了一种数据处理装置,所适用的系统与上述方法所适用的系统相同。所述装置用于实现第一归约节点,所述装置包括请求单元、接收单元和执行单元。
请求单元,用于获知所述第一映射节点将第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息,所述数据请求消息包含所述第一归约节点的标识,其中,所述第一数据段是指在所述第一映射节点根据其执行映射任务所得的执行结果获取到的至少一个数据段中由所述第一归约节点处理的数据段。
接收单元,用于接收所述第一映射节点反馈的响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,所述响应消息由所述第一映射节点在响应所述数据请求消息时根据所述第一归约节点的标识生成。
执行单元,用于根据所述响应信息从所述远端共享分区存储的所述文件 中获取所述第一数据段,对所述第一数据段执行归约任务。
在一个可能的设计中,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享。
所述执行单元,用于根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,包括:所述执行单元,用于根据所述响应信息中的所述分区名称,确定被所述第一映射节点和所述第一归约节点共享的远端共享分区;从被所述第一映射节点和所述第一归约节点共享的远端共享分区中,根据所述响应信息中的所述文件名读取所述第一数据段。
在一个可能的设计中,所述存储池包括被第一归约节点独享的第二远端私有分区。
所述归约节点包括转存储单元;转存储单元,用于在将从所述远端共享分区读取的所述第一数据段存储在本地内存的过程中,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
应知,本方面及细节设计所提供的数据处理装置,该装置包括的各个单元可用于实现上述方法中第一归约节点所具有的功能。
又一方面,本申请提供了一种系统,所述系统为所述方法适用的系统,也为上述装置所适用的系统;所述系统包括内存池、上述第一映射节点和上述第一归约节点。
相较于现有技术,映射节点利用存储池中的远端共享分区存储映射任务的执行结果,归约节点可直接从该远端共享分区读取该映射任务的执行结果,减少了执行Map/Reduce任务的执行时间。
附图说明
图1为硬件解耦场景的系统结构示意图;
图2为基于远端共享分区执行Map/Reduce任务的一种示范性流程图;
图3为基于远端共享分区和远端私有分区执行Map/Reduce任务的一种示范 性流程图;
图4为从映射节点角度描述数据处理方法的一种基础流程图;
图5为图4中步骤A401的一种可选细化流程图;
图6从归约节点角度描述数据处理方法的一种基础流程图;
图7为图6中步骤A603的一种可选细化流程图;
图8为可用作映射节点的数据处理装置801的一种逻辑结构示意图;
图9为可用作归约节点的数据处理装置901的一种逻辑结构示意图;
图10为计算机设备1000的一种硬件架构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例;基于本发明中的实施例,本领域普通技术人员所获得的其他实施例,都属于本发明保护的范围。
如图1所示,对多个计算机设备包括的中央处理器(英文全称Central Processing Unit,英文简称CPU)、存储介质和输入/输出(英文全称Input/Output,英文简称I/O)设备解耦,由解耦后的存储介质组成一个或多个存储池103(图1示意为1个存储池103),由解耦后的CPU组成一个或多个CPU池(图1示意为CPU池101和CPU池102),由解耦后的I/O设备组成一个或多个I/O池104(图1示意为1个I/O池104)。
存储介质,包括内存(英文全称Memory,英文简称Mem)、移动硬盘、只读存储器(英文全称:Read-Only Memory,简称ROM)、随机存取存储器(英文全称:Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。一个计算机设备可能具有一种或多种存储介质,不同计算机设备具有的存储介质可能不同,因此在通常情况下,存储池103包含的存储介质不止一种类型。
可选地,为了提高数据读写效率,仅由解耦后的内存(英文全称Memory,英文简称Mem)组成存储池103,这样的存储池103实际上是内存池。
计算机设备包括的CPU可以是轻核(如部分仅支持单线程的ARM系列处理器),可以是MIC(英文全称Many Integrated Core)架构的众核,还可能是其它具有数据处理能的核。一个计算机设备可能具有一种或多种类型的核,不同计算机设备具有的核可能不同,相应地,CPU池具有的CPU可能是一种或多种类型,例如:轻核、MIC。
可选地,计算机设备中存在未解耦到存储池103的本地内存(图1未示意),CPU池中与该本地内存连接的CPU(通常与该本地内存配置在同一计算机设备中)可以访问,具体是通过内存总线访问;但CPU池中未与该本地内存连接的CPU不能访问该本地内存。
计算机设备可能具有一种或多种I/O设备,例如磁盘I/O、网络I/O或者其它I/O设备。相应地,该I/O池104包括的I/O设备可能为一种或多种类型。
如图1所示,CPU池连接的控制器,存储池103连接的控制器,I/O池104连接的控制器,通过高速互联通道通信连接,该高速互联通道可选地为利用硅光搭建的光通道;当然,各控制器之间也可以通过其它介质或网络实现各个控制器的通信连接,在此不应构成对本发明的限定。各个控制器之间通信连接的情况下,各个组件(CPU池中的CPU、存储池103中的内存、I/O池104中的I/O)可以通过控制器之间的消息交互实现组件之间的消息交互。
与CPU池连接的控制器,控制CPU资源的分配,和对CPU资源进行调度,以便与其它CPU、内存、I/O设备协同工作,完成系统分配的任务。CPU池可能是一个或多个,如1示意为CPU池101和CPU池102;CPU池内的各个CPU之间通过共同的控制器进行消息通信;不同CPU池中的CPU,需利用各自的控制器之间的消息通信能力,通过控制器之间的消息通信实现相互(不同CPU池中的CPU)的消息通信;例如,CPU池101内的CPU与CPU池102内的CPU进行消息通信的具体实现是:CPU池101内的CPU首先将消息发送至CPU池101的控制器,CPU池101的控制器将该消息转发至CPU池102的控制器,CPU池102的控制器再将该消息发送至CPU池102的CPU。
与I/O池104连接的控制器,控制I/O资源的分配,和对I/O资源进行调度。一个举例,CPU池中的CPU触发载有I/O访问请求的消息,并通过CPU 池连接的控制器发送该消息至I/O池104连接的控制器;I/O池104连接的控制器,根据该I/O访问请求从I/O池104分配I/O,通过分配的I/O访问外部设备,例如向外部设备发送指令,再例如从外部设备获取数据。又一个举例,I/O池104连接的控制器,可以根据CPU池中的CPU的请求分配I/O,并基于与存储池103连接的控制器的消息通信,将存储池103中存储的数据通过分配的I/O向外部设备输出,和/或将从分配的I/O获取的外部设备的数据写入该存储池103的存储介质中。
与存储池103连接的控制器,用于管理该存储资源,包括分配存储池103中存储资源、为分配的存储资源设置访问权限。本实施例中,存储池103的控制器,从该存储池103中划分出对CPU池中多个CPU(包括运行有映射节点的CPU和运行有归约节点的CPU)共享的远端共享分区,该CPU访问该远端共享分区的实现方式为:该CPU向CPU池的控制器发送载有分区访问请求的消息,基于存储池103连接的控制器与CPU池的控制器之间的消息通信,存储池103的控制器接收到该载有分区访问请求的消息并对该分区访问请求指定的远端共享分区进行访问。可见,CPU访问远端共享分区的实现方式与访问本地内存的实现方式不同,CPU池中的多个CPU可能可以访问同一个远端共享分区,而本地内存仅供与本地内存连接的本地CPU访问,并且本地CPU通常是通过内存总线直接访问本地内存。
存储池103的控制器可以为CPU池中多个CPU设置访问该远端共享分区的访问权限,可以设置的访问权限为以下任一种:只读权限、只写权限、可读可写权限或者其他权限。如果存储池103的控制器为某个CPU设置了该远端共享分区的只读权限,该个CPU以只读权限挂载该远端共享分区,进而可以通过存储池103的控制器从该远端共享分区读取数据。如果存储池103的控制器为某个CPU设置的是该远端共享分区的只写权限,该个CPU以只写权限挂载该远端共享分区,进而可以通过存储池103的控制器向该远端共享分区写入数据。如果存储池103的控制器为某个CPU设置的是该远端共享分区的可读可写权限,该个CPU以可读可写权限挂载该远端共享分区,既可以通过存储池103的控制器从该远端共享分区读取数据,还可以通过存储池103的控制器向该远端共享分区写入数据。这里所说的挂载,是指将该远端共享 分区看作文件,挂接到一个已存在的目录上,通过该目录来访问远端共享分区中的文件。
具体在本发明实施例中,从存储池103中划分出的远端共享分区,供执行映射/归约(Map/Reduce)任务使用;映射节点和归约节点运行在CPU池中的不同CPU上,映射节点利用该远端共享分区存储映射任务的执行结果,归约节点可直接从该远端共享分区读取映射任务的执行结果,对该映射任务的执行结果执行归约任务,省去了现有技术的部分动作,缩短了执行映射/归约任务的时间,省去的部分动作包括:映射节点将映射任务的执行结果通过磁盘I/O写入本地磁盘,在归约节点请求该执行结果时再通过磁盘I/O从本地磁盘读取该映射任务的执行结果,并通过网络I/O向归约节点发送载有该执行结果的TCP流。如果存储池103具体为内存池,相对于现有技术,映射节点和归约节点向远端共享分区(属于存储池103)读/写数据的速度快于向本地磁盘读/写数据的速度,进一步缩短了执行映射/归约任务的时间。
本发明实施例基于如图1所示的架构,为在该架构上执行映射/归约任务,首先需要做如下动作:
第一个动作,从CPU池中确定一个CPU,用于运行实现管理节点(master)的进程。从CPU池中确定至少两个CPU,用于运行实现多个工作节点(worker)的进程。
第二个动作,管理节点为某个空闲的工作节点分配映射任务,执行映射任务的工作节点成为映射节点,映射节点不属于空闲的工作节点。管理节点为某个空闲的工作节点分配归约任务,执行归约任务的工作节点成为归约节点,归约节点不属于空闲的工作节点。可见,映射节点与归约节点为不同的节点。
本实施例适用的场景中,映射节点与归约节点是运行在不同CPU上的。多个映射节点可以运行在一个或多个CPU上;如果多个映射节点运行在同一个CPU上,该多个映射节点是作为线程运行的;如果多个映射节点运行各自运行在不同CPU上,该多个映射节点是作为进程运行的。类似地,多个归约节点可以运行在一个或多个CPU上;如果多个归约节点运行在同一个CPU上,该多个归约节点是作为线程运行的;如果多个归约节点运行各自运行在不同CPU上,该多个归约节点是作为进程运行的。
第三个动作,管理节点通过消息通信向存储池103的控制器申请可共享的远端共享分区;存储池103的控制器响应该申请,并划分了远端共享分区,存储池103的控制器通过消息通信向该管理节点反馈:该远端共享分区的分区名称、访问该远端共享分区的访问权限。管理节点再通知映射节点:该远端共享分区的分区名称,为该映射节点申请到的访问权限。管理节点再通知归约节点:该远端共享分区的分区名称,为该归约节点申请到的访问权限。
第四个动作,本实施例是对数据集执行映射/归约任务,在执行映射任务(Map任务)之前,将数据集划分成数据分片,可选地,具体划分规则可以根据任务需要和/或执行效率确定,例如将数据集按照16MB到64MB范围内的一个值或多个值进行数据分片的划分,在此不作为对本发明的限定。本实施例将一个数据分片作为一个映射节点的输入,使得映射节点对输入的数据分片执行映射任务。
基于如图1所示的架构,实施例一、实施例二讲解在该架构上执行映射/归约任务的方法;实施例三对实施例一和实施例二进行扩展,并从映射节点角度讲解在该架构上提供的数据处理方法;实施例四对实施例一和实施例二进行扩展,并从归约节点角度讲解在该架构上提供的数据处理方法;实施例五从映射节点角度讲解与实施例三提供的方法对应的数据处理装置;实施例六从归约节点角度讲解与实施例四提供的方法对应的数据处理装置;实施例七讲解可执行实施例三和实施例四提供的数据处理方法的计算机设备;实施例八讲解在在该架构上组建的系统。
实施例一
下面结合图2,详述下引入远端共享分区的情况下执行Map/Reduce任务的一种具体流程,该流程包括以下步骤:步骤S201、步骤S202、步骤S203、步骤S204、步骤S205和步骤S206。
为便于说明,本实施例以将数据集划分成两个数据分片(参见图2)为例对执行Map/Reduce任务的流程进行描述,因此根据数据集划分出的数据分片为两个仅是示意,不作为对实施例的限定。如图2所示,一个数据分片由第一映射节点处理,另一个数据分片由第二映射节点处理。本实施例中,管 理节点分配了三个归约任务,一个归约任务由一个归约节点处理,因此总共有三个归约节点,包括第一归约节点、第二归约节点和第三归约节点,图2示出了第一归约节点,图2未示第二归约节点和第三归约节点。
需说明的是,管理节点在确定两个映射节点和三个归约节点后,会将三个归约节点的标识、以及三个归约节点各自针对的键值(Key)通知两个映射节点。
参见图2,管理节点从存储池的控制器申请到的远端共享分区为6个。管理节点将这6个远端共享分区分配给两个映射节点和三个归约节点的具体规则如下:
对第一映射节点共享的远端共享分区包括远端共享分区1、远端共享分区2、远端共享分区3,并为第一映射节点分配了这三个远端共享分区的可读可写权限;
对第二映射节点共享的远端共享分区包括远端共享分区4、远端共享分区5、远端共享分区6,并为第二映射节点分配了这三个远端共享分区的可读可写权限;
对第一归约节点共享的远端共享分区包括远端共享分区1和远端共享分区4,并为第一归约节点分配远端共享分区1和远端共享分区4的可读权限;
对第二归约节点(图2未示)共享的远端共享分区包括远端共享分区2和远端共享分区5,并为第二归约节点分配了远端共享分区2和远端共享分区5的可读权限;
对第三归约节点(图2未示出)共享的远端共享分区包括远端共享分区3和远端共享分区6,并为第三归约节点分配了远端共享分区3和远端共享分区6的可读权限。
下面主要从第一映射节点和第一归约节点的角度解释本实施例执行Map/Reduce任务的具体流程。
步骤S201,第一映射节点,以可读可写访问权限挂载远端共享分区1、远端共享分区2、远端共享分区3,具体地,可以通过mount指令实现对远端共享分区的挂载。
第一映射节点对输入的数据分片执行Map任务。第一映射节点执行Map任务的过程中,按照时间顺序将执行Map任务所得的执行结果依次写入内存 缓冲池中的环形内存缓冲区,再将环形内存缓冲区中的该执行结果溢写到本地磁盘或者存储池中,图2仅示意了溢写到存储池中,每次溢写得到一个溢写文件。其中,具体的溢写过程是:将环形内存缓冲区中本次期望溢写的所述执行结果根据Reduce任务针对的键值(Key)进行拆分(Partition),拆分得到3个(等于归约节点的个数)数据片段,此处是以每次得到的所述执行结果都具有每个归约节点所针对的键值(Key)为例,当然,如果某次得到的所述执行结果不具有某个归约节点所针对的键值(Key),则不会拆分得到该个reduce节点对应的数据片段,这时拆分得到数据片段的个数少于3个;然后再根据键值(Key)对3个数据片段各自进行排序(sort),将3个排序后的数据片段以一个文件的形式溢写到本地磁盘或者存储池中。可见,每次溢写得到的溢写文件,均包含3个已排序的数据片段。
步骤S202,第一映射节点执行完Map任务,从本次磁盘或者存储池读取不同时间点溢写的溢写文件。
第一映射节点从每个溢写文件中获取3个归约节点各自处理的已排序的数据片段。再对具有单个归约节点针对的键值(Key)的所有已排序的数据片段进行排序(sort)、合并(merge),得到该单个归约节点处理的数据段;以此类推,第一映射节点可以基于所有溢写文件中的已排序的数据片段,分别排序、合并得到3个归约节点各自处理的数据段。举例说明,第一映射节点根据第一归约节点针对的键值(Key),从每个溢写文件包含的3个已排序的数据片段中分别获取由第一归约节点处理的数据片段,再根据一归约节点针对的键值(Key)对从每个溢写文件获取的数据片段进行排序(sort)、合并(merge),得到第一数据段,该第一数据段是指由第一归约节点处理的数据段。
第一映射节点将这三个数据段分别以文件的形式一一对应地存储至远端共享分区1、远端共享分区2、远端共享分区3这三个远端共享分区;具体地,将包含第一数据段(由第一归约节点处理)的文件存储至远端共享分区1,将包含由第二归约节点处理的数据段的文件存储至远端共享分区2,将包含由第三归约节点处理的数据段的文件存储至远端共享分区3。举例说明存储实现原理,因为3个归约节点各自针对的键值(Key),并且远端共享分区1、 远端共享分区2、远端共享分区3一一对应地被第一归约节点、第二归约节点和第三归约节点共享,从而第一映射节点能够根据三个数据段各自具有的键值(Key),确定三个数据段各自的远端共享分区。
第一映射节点将数据段以文件的形式存储至远端共享分区时,会为该文件设置文件名;并存储该远端共享分区的分区名称和该文件名,存储位置可以是第一映射节点所在计算机设备中的本地内存或本地磁盘,当然存储位置也可以是计算机设备外的其他存储介质。可选地,为不同远端共享分区中的文件设置的文件名可以不同,也可以相同;如果不同远端共享分区中的文件具有相同文件名,可以根据不同远端共享分区的分区名称的不同来区分相同文件名的两个文件各自由对应的归约节点。第一映射节点执行完Map任务后会向管理节点发送第一消息,该第一消息还携带有该第一映射节点的地址信息,该地址信息包括第一映射节点的地址和/或第一映射节点的身份标识,通过该第一消息通知管理节点:该地址信息指定的映射节点(第一映射节点)已执行完该管理节点分配的Map任务。
另外,第一映射节点执行完Map任务后,卸载远端共享分区1、远端共享分区2、远端共享分区3,具体地,第一映射节点可以通过unmount指令卸载这三个远端共享分区。这里所说的卸载,是指在第一映射节点中删除对应的分区设备,与上述的挂载是相反动作。
步骤S203,管理节点(图2未示出)接收到该第一消息,根据该第一消息中的地址和/或身份标识确定第一映射节点已执行完分配的Map任务;管理节点确定第一映射节点已执行完分配的Map任务后,向第一归约节点发送载有该第一映射节点的地址信息的第二消息。可替换地,也可以由第一归约节点每间隔特设时间主动向管理节点发送查询请求;管理节点在接收到第一映射节点执行完映射任务所触发的第一消息时,响应最新的查询请求,向第一归约节点发送载有该第一映射节点的地址信息的第二消息。
类似地,管理节点根据该第一消息确定第一映射节点已执行完分配的Map任务后,也会向第二归约节点和第三归约节点分别发送载有该第一映射节点的地址信息的消息。
步骤S204,第一归约节点接收管理节点发送的第二消息,根据第二消息获 知第一映射节点已执行完分配的Map任务。第一归约节点获知第一映射节点已执行完分配的Map任务后,根据该第一映射节点的地址信息生成数据请求消息,并向该第一映射节点发送数据请求消息,该数据请求消息载有第一归约节点的标识,该标识用于将第一归约节点与其他两个归约节点区分;例如:管理节点已预先对三个归约节点编号,第一归约节点的标识可以是指管理节点为第一归约节点设定的编号。
类似地,第二归约节点和第三归约节点根据管理节点发送的的消息获知第一映射节点已执行完分配的Map任务后,分别向第一映射节点请求数据的消息。
步骤S205,第一映射节点接收第一归约节点发送的该数据请求消息。第一映射节点从该数据请求消息获取第一归约节点的标识,并将获取的标识与预先存储的3个归约节点的标识匹配,匹配识别出该数据请求消息是第一归约节点发送的;继而,第一映射节点响应该数据请求消息,并根据所述第一归约节点的标识生成响应消息;该响应消息携带有远端共享分区1的分区名称,该响应消息还携带有包含第一数据段的文件的文件名。第一映射节点向第一归约节点发送该响应消息。
类似地,第一映射节点在接收到第二归约节点和第三归约节点发送的请求数据的消息后,向第二归约节点反馈远端共享分区2的分区名称和包含由第二归约节点处理的数据段的文件的文件名,向第三归约节点反馈远端共享分区3的分区名称和包含由第三归约节点处理的数据段的文件的文件名。
步骤S206,第一归约节点接收第一映射节点发送的响应消息,从该响应消息获取远端共享分区1的分区名称。第一归约节点根据远端共享分区1的分区名称以只读权限挂载远端共享分区1,如通过mount指令实现远端共享分区1的挂载。
第一归约节点从该响应消息获取包含第一数据段的文件的文件名。第一归约节点与存储池的控制器进行消息通信;通过存储池的控制器,根据获取的文件名(即该响应消息携带的文件名)从远端共享分区1中找到包含第一数据段所在的文件,从该文件读取第一数据段。
与第一映射节点将3个归约节点处理的数据段一一对应地存储至远端共享分区1、远端共享分区2和远端共享分区3类似,第二映射节点执行完Map任务 后合并出3个归约节点各自处理的数据段,将包含由第一归约节点处理的数据段以文件的形式写入远端共享分区4,将包含由第二归约节点处理的数据段以文件的形式写入远端共享分区5,将包含由第三归约节点处理的数据段以文件的形式写入远端共享分区6。
第一归约节点挂载远端共享分区4,从远端共享分区4中的文件中读取由第一归约节点处理的数据段。继而,第一归约节点对从远端共享分区1读取的第一数据段和从远端共享分区4读取的数据段执行Reduce任务。
后续,第一归约节点可以将执行归约任务所得的执行结果合并(merge)后写入本地的存储介质(例如磁盘),也可以将所得的执行结果合并(merge)后写入存储池;当然,也可以将所得的执行结果合并(merge)后写入其他存储介质,此处不做限定。
第一归约节点从远端共享分区1、远端共享分区4读取完数据段后,卸载远端共享分区1、远端共享分区4,例如第一归约节点可以使用unmount指令卸载远端共享分区1、远端共享分区4。
与第一归约节点的工作原理类似地,第二归约节点根据第一映射节点反馈的远端共享分区2的分区名称挂载远端共享分区2,根据第二映射节点反馈的远端共享分区5的分区名称挂载远端共享分区5,分别从远端共享分区2和远端共享分区5读取由第二归约节点处理的数据段;第二归约节点读取完数据段后,卸载远端共享分区2和远端共享分区5;第二归约节点对从远端共享分区2和远端共享分区5分别读取的数据段执行Reduce任务,将执行结果合并(merge)后写入存储介质(例如本地磁盘、存储池)。
类似地,第三归约节点根据第一映射节点反馈的远端共享分区3的分区名称挂载远端共享分区3,根据第二映射节点反馈的远端共享分区6的分区名称挂载远端共享分区6,分别从远端共享分区3和远端共享分区6读取由第三归约节点处理的数据段;第三归约节点读取完数据段后,卸载远端共享分区3和远端共享分区6;第三归约节点对从远端共享分区3和远端共享分区6分别读取的数据段执行归约任务,将执行结果合并(merge)后写入存储介质(例如本地磁盘、存储池)。
本实施例中,映射节点将数据段存储在远端共享分区中,归约节点可直接 从远端共享分区获取到由自己处理的数据段,从而省去了现有技术中归约节点向映射节点请求数据段时先由映射节点从本地磁盘中读取数据段再通过TCP流向该归约节点发送读取的数据段这些步骤,有效节省了执行Map/Reduce任务所需的时间。如果远端共享分区是从内存池划分的,相对于现有技术,映射节点和归约节点访问远端共享分区来读/写数据的速度快于向本地磁盘读/写数据的速度,进一步缩短了执行映射/归约任务的时间。
实施例二
本实施例中,数据集是采用Hadoop分布式文件系统(Hadoop Distributed File System,简称HDFS)存储和管理的;因此,本实施例将数据集划分成的两个数据分片,图3示意为数据分片301和数据分片302,也存储在HDFS中。应知,根据数据集划分出的数据分片为两个仅是示意,划分出的数据分片可以是一个或多个,因此数据分片的个数不作为对实施例的限定。
管理节点(图3未示出)从空闲的工作节点中确定了两个映射节点,如图3示意为第一映射节点和第二映射节点,由第一映射节点对数据分片301执行Map任务,由第二映射节点对数据分片302执行Map任务;另外,管理节点从空闲的工作节点中确定了三个归约节点,如图3示意为第一归约节点、第二归约节点和第三归约节点;管理节点在确定这两个映射节点和这三个归约节点后,会将这三个归约节点的标识、以及这三个归约节点各自针对的键值(Key)通知这两个映射节点。
与实施例一类似地,管理节点从存储池的控制器申请了6个远端共享分区,包括远端共享分区1、远端共享分区2、远端共享分区3、远端共享分区4、远端共享分区5、远端共享分区6。存储池的控制器为这6个远端共享分区设置了共享权限,如下:其中,远端共享分区1对第一映射节点和第一归约节点共享,远端共享分区2对第一映射节点和第二归约节点共享,远端共享分区3对第一映射节点和第三归约节点共享,远端共享分区4对第二映射节点和第一归约节点共享,远端共享分区5对第二映射节点和第二归约节点共享,远端共享分区6对第二映射节点和第三归约节点共享。
管理节点还从存储池的控制器为每个映射节点和每个归约节点分别申请 了远端私有分区,并为第一映射节点分配了访问远端私有分区331的私有权限,为第二映射节点分配了访问远端私有分区332的私有权限,为第一归约节点分配了访问远端私有分区333的私有权限,为第二归约节点分配了访问远端私有分区334的私有权限,为第三归约节点分配了访问远端私有分区335的私有权限。远端私有分区是独享的,也即是非共享的,分配有私有权限的节点才能访问该远端私有分区,例如:为第一映射节点分配了远端私有分区331的私有权限,第一映射节点能够访问远端私有分区331,但其他映射节点或者归约节点是不能够访问远端私有分区331的。
实施例二执行Map/Reduce任务的流程实现与实施例一执行Map/Reduce任务的流程实现类似,但存在以下三个不同点,包括第一不同点、第二不同点和第三个不同点。
第一个不同点,第一映射节点挂载远端私有分区331。第一映射节点对数据分片301执行Map任务的过程中,并将执行Map任务的执行结果缓存到缓冲区,将该缓冲区缓存的执行结果溢写到远端私有分区331,每次溢写得到一个溢写文件。待第一映射节点对数据分片301执行完Map任务之后,第一映射节点是从远端私有分区331读取不同时间点溢写的溢写文件;基于读取的所有溢写文件中的数据段,合并(merge)得到由第一归约节点处理的第一数据段、由第二归约节点处理的数据段、由第三归约节点处理的数据段,此处实现合并(merge)的实现方式与实施例一中第一映射节点合并(merge)成3个数据段的实现方式同原理,此处实现合并(merge)的细节可参照实施例一中第一映射节点将所有溢写文件包含的已排序的数据片段合并(merge)成3个归约节点各自处理的数据段的具体实现方式,在此不再赘述。第一映射节点将包含第一数据段的文件存储至远端共享分区1,将包含由第二归约节点处理的数据段的文件存储至远端共享分区2,将包含由第三归约节点处理的数据段的文件存储至远端共享分区3。
类似地,第二映射节点挂载远端私有分区332;第二映射节点使用远端私有分区332存储溢写文件以及从远端私有分区332读取溢写文件的实现方式与第一映射节点类似,在此不再赘述。另外,第二映射节点将包含由第一归约节点处理的数据段的文件存储至远端共享分区4,将包含由第二归约节点 处理的数据段的文件存储至远端共享分区5,将包含由第三归约节点处理的数据段的文件存储至远端共享分区6。
对于大数据应用的数据集,数据集的数据量非常大,相应地需要更大的存储空间来存储执行映射任务所得的溢写文件;这种情情况下,本地磁盘的存储空间有限,但存储池具有足够大的存储空间,可利用从存储池划分出的远端私有分区存储溢写文件,由于远端私有分区是独享的,保证了溢写文件不被非法修改,保证了溢写文件的安全性。如果远端私有分区具体是从内存池划分的,相对于现有技术,映射节点和归约节点访问私有共享分区来读/写数据的速度快于访问本地磁盘来读/写数据的速度,进一步缩短了执行映射/归约任务的时间。
第二个不同点,第一归约节点挂载远端私有分区333。第一归约节点挂载远端共享分区1和远端共享分区4,分别从远端共享分区1和远端共享分区4读取由第一归约节点处理的数据段;第一归约节点首先使用本地内存存储从远端共享分区1和远端共享分区4读取的数据段,如果本地内存不足,再将剩余的数据段(包括后续从远端共享分区1和远端共享分区4读取的数据段)存储在远端私有分区333。可选地,也可以一开始就使用远端私有分区333存储从远端共享分区1和远端共享分区4读取的数据段,而不使用本地内存存储。
类似地,第二归约节点挂载远端私有分区334。第二归约节点挂载远端共享分区2和远端共享分区5,分别从远端共享分区2和远端共享分区5读取由第二归约节点处理的数据段;第二归约节点仍首先使用本地内存存储从远端共享分区2和远端共享分区5读取的数据段,如果内存不足,再将剩余的数据段(包括后续从远端共享分区2和远端共享分区5读取的数据段)存储在第二归约节点的远端私有分区334。可选地,也可以一开始就使用远端私有分区334存储从远端共享分区2和远端共享分区5读取的数据段,而不使用本地内存存储。
类似地,第三归约节点挂载远端私有分区335。第三归约节点挂载远端共享分区3和远端共享分区6,分别从远端共享分区3和远端共享分区6读取由第三归约节点处理的数据段;第三归约节点仍首先使用本地内存存储从远端共享分区3和远端共享分区6读取的数据段,如果本地内存不足,再将剩余 的数据段(包括后续从远端共享分区3和远端共享分区6读取的数据段)存储在第三归约节点的远端私有分区335。可选地,也可以一开始就使用远端私有分区335存储从远端共享分区2和远端共享分区5读取的数据段,而不使用本地内存存储。
本实施例使用远端私有分区存储数据段,能够增大归约节点可处理的数据段的数据量,由于远端私有分区是独享的,避免数据段被非法修改。如果远端私有分区具体是从内存池划分的,相对于现有技术,映射节点和归约节点访问私有共享分区来读/写数据的速度快于访问本地磁盘来读/写数据的速度,进一步缩短了执行映射/归约任务的时间。
第三个不同点,第一归约节点将对数据段执行Reduce任务所得的执行结果,以文件的形式存储到HDFS中的存储区域321,存储区域321为HDFS中的一块存储空间。第二归约节点将对数据段执行Reduce任务所得的执行结果,以文件的形式存储到HDFS中的存储区域322,存储区域322为HDFS中的一块存储空间。第三归约节点将对数据段执行Reduce任务所得的执行结果,以文件的形式存储到HDFS中的存储区域323,存储区域323为HDFS中的一块存储空间。
可选地,作为实施例二中私有存储分区331的替代实现,可以利用远端共享分区1替代远端私有分区331来存储第一映射节点溢写出的溢写文件,这时,待第一映射节点执行完Map任务,是从远端共享分区1获取所有溢写文件。作为实施例二中私有存储分区332的替代实现,可以利用远端共享分区2替代远端私有分区332来存储第一映射节点溢写出的溢写文件,这时,待第一映射节点执行完Map任务,是从远端共享分区2获取所有溢写文件。
当然,也可以使用存储池中的其他存储空间替换实现实施例二中的远端私有分区在实施例二中所起的作用。
可选地,归约节点从远端共享分区读取完数据数后,通知存储池的控制器回收该远端共享分区;存储池的控制器回收该远端共享分区,后续可以将该远端共享分区分配给其他任务节点(例如其他执行Map/Reduce任务的映射节点)使用,提高了远端共享分区的利用率。类似地,映射节点从远端私有分 区读取完溢写文件后,通知存储池的控制器回收该远端私有分区;存储池的控制器回收该远端私有分区,后续可以将该远端私有分区分配给其他任务节点(例如其他执行Map/Reduce任务的映射节点)使用,提高了远端私有分区的利用率。类似地,归约节点从远端私有分区读取完数据段后,通知存储池的控制器回收该远端私有分区;存储池的控制器回收该远端私有分区,后续可以将该远端私有分区分配给其他任务节点(例如其他执行Map/Reduce任务的映射节点)使用,提高了远端私有分区的利用率。
可选地,对实施例一和/或实施例二中的溢写过程做一可选细化,对该细化详述如下:
映射节点执行Map任务的过程中,将该Map任务的执行结果按照时间顺序缓存到环形内存缓冲区中,当该环形内存缓冲区的使用率达到80%时,触发对该环形内存缓冲区缓存的Map任务的执行结果的一次溢写,溢写过程参见实施例一和实施例二中的相应描述,在此不再赘述。溢写过程中,因该环形内存缓冲区仍有20%未被使用,映射节点可继续向该未被使用的缓冲区写入Map任务的执行结果,不会停止映射节点继续输出Map任务的执行结果,从而达到不停止执行Map任务的目的。完成溢写后,原用于缓存Map任务的执行结果的缓冲区,能够重新用于缓存Map任务的执行结果。
值得说明的是,存储池的存储容量可变,并且通常具有海量内存空间,通过单个文件系统来管理存储池是不实际的,从而在本发明实施例中存储池的控制器根据需求从存储池中划分出分区,包括远端共享分区、远端私有分区;每个分区单独设置文件系统,并为每个分区命名,使得每个分区都具有用于相互区分的分区名称;分区可像其他存储介质一样进行挂载/卸载,如像在Linux系统中挂载硬件设备并将该硬件设备映射成系统中的文件一样;本发明实施例能够直接根据分区名称挂载存储池中的分区,访问该分区中的文件,访问权限可在挂载时通过参数设定。
可选地,如果映射节点运行在解耦后的CPU池中的CPU上,但归约节点不 是运行在解耦后的CPU池中的CPU上,映射节点仍然将数据段存储至远端共享分区;继而,归约节点与映射节点通信,建立TCP流;继而,映射节点从远端共享分区读取由该归约节点处理的数据段,并通过该TCP流将读取的数据段发送至该归约节点。从而,本实施例提供的Map/Reduce任务的实现方式可以对现有技术执行Map/Reduce任务的实现方式兼容,只是存储数据段的位置不同,如果远端共享分区是从内存池划分出的,本发明实施例访问远端共享分区来读取数据段的速率大于现有技术访问本地磁盘来读取数据段的速率,可以减少执行Map/Reduce任务的执行时间。
实施例三
实施例三对上述两个实施例作相应扩展,从映射节点的角度提供了数据处理方法的基础工作流程;该基础工作流程所适用的系统的系统架构参见图1所示的系统架构,对该系统架构的具体描述可参见上述对图1所示系统架构的相应描述。该系统包括CPU池和存储池,所述CPU池与所述存储池通信连接。
所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点;可见,实施例一和实施例二中映射节点为两个仅是举例,实施例一和实施例二中归约节点为三个也仅是举例。所述至少一个映射节点包括第一映射节点,第一映射节点为所述至少一个映射节点中的任一个映射节点;所述至少一个归约节点包括第一归约节点,第一归约节点为所述至少一个规约节点中的任一个归约节点;所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上,所述第一映射节点和所述第一归约节点之间的消息通信是通过CPU池的控制器转发消息(例如数据请求消息)来实现的。
所述存储池包括的远端共享分区被所述第一映射节点和所述第一归约节点共享,可选地,所述存储池为内存池。具体地,管理节点为第一映射节点和第一归约节点从存储池的控制器申请到远端共享分区以及该远端共享分区的访问权限,并将远端共享分区分配给第一映射节点和第一归约节点,分配的实现方式可参见实施例一和实施例二对分配远端共享分区的相关描述。此 处的远端共享分区,可以是对所述至少一个映射节点中的所有映射节点共享,还可以是对所述至少一个归约节点中的所有归约节点共享,当然也可以是仅对第一映射节点和第一归约节点共享的,但至少是对第一映射节点和第一归约节点共享的。本实施例中,所述第一映射节点和所述第一归约节点,均是通过挂载方式访问所述远端共享分区。
如图4所示的基础工作流程是从第一映射节点的角度给出的,图4提供的基础工作流程包括:步骤A401、步骤A402、步骤A403、步骤A404和步骤A405。
首先将数据集划分为一个或多个数据分片,具体划分方式参加上述对第四个动作(为在该架构上执行映射/归约任务所作的第四个动作)的相关描述;每个数据分片的大小可以不同,也可以相同。一个数据分片作为一个映射节点的输入,一个映射节点对一个数据分片执行映射任务。
步骤A401,所述第一映射节点对数据分片执行映射任务(即实施例一和实施例二所述的Map任务),根据所述映射任务的执行结果获取至少一个数据段,所述至少一个数据段中的每个数据段由一个与之相对应的归约节点处理,其中,所述至少一个数据段包括第一数据段,第一数据段是指由所述第一归约节点处理的数据段;
步骤A402,所述第一映射节点将所述第一数据段以文件格式存储在所述远端共享分区中;具体地,所述第一映射节点在远端共享分区新建一个文件,并将第一数据段写入该文件。
步骤A403,所述第一映射节点接收所述第一归约节点发送的数据请求消息,所述数据请求消息包含所述第一归约节点的标识;
步骤A404,所述第一映射节点响应所述数据请求消息,并根据所述第一归约节点的标识生成响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名;如果该远端共享分区是对所述至少一个映射节点中多个映射节点共享的和/或是对所述至少一个归约节点中的多个归约节点共享的,该远端共享分区存储的各个包含数据段的文件的文件名是不同的,便于通过文件名区分包含数据段的文件;
步骤A405,所述第一映射节点向所述第一归约节点反馈所述响应消息, 使得所述第一归约节点根据所述响应信息从具有所述分区名称的远端共享分区中具有所述文件名的文件获取所述第一数据段,对所述第一数据段执行归约任务(即实施例一和实施例二所述的Reduce任务)。
对于第一映射节点执行步骤A401、步骤A402、步骤A403、步骤A404、步骤A405的实现细节,可参见实施例一和实施例二中第一映射节点执行相关步骤的对应描述。
可选地,从所述存储池为各个映射节点分别分配了非共享的存储空间,此处定义各个映射节点节点私有的存储空间为第一远端私有分区,因此各个映射节点均具有独享的第一远端私有分区,可见存储池包括被第一映射节点独享的第一远端私有分区。下面结合第一映射节点的第一远端私有分区,从如何得到第一数据段的角度对步骤A401做一可选细化,参见图5,所述第一映射节点根据所述映射任务的执行结果获取至少一个数据段,具体包括步骤A501和步骤A502。
步骤A501,将所述第一映射节点执行映射任务时溢写的溢写文件存储至所述第一映射节点的第一远端私有分区,其中,单个所述溢写文件包括所述第一映射节点执行所述映射任务过程中单次从缓存所述映射任务的执行结果的缓冲区中溢写出的数据;
步骤A502,对所述第一映射节点的第一远端私有分区中存储的多个所述溢写文件根据所述至少一个归约节点中不同所述归约节点所对应的键值分别合并,合并得到所述至少一个数据段。
对于第一映射节点执行步骤A501和、步骤A502的实现细节,可参见实施例二中第一映射节点使用远端私有分区331存储溢写文件的对应描述。
可选地,使用远端共享分区存储溢写文件来替代使用第一远端私有分区存储溢写文件,使用远端共享分区存储溢写文件的实现方式与使用第一远端私有分区存储溢写文件的实现方式类似,在此不再赘述。
可选地,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;例如,实施例一和实施例二分别示意为6个远端共享分区,包括远端共享分区1、远端共享分区2、远端共享分区3、远端共享分区4、 远端共享分区5、远端共享分区6;远端共享分区1对第一映射节点和第一归约节点共享,远端共享分区2对第一映射节点和第二归约节点共享,远端共享分区3对第一映射节点和第三归约节点共享,远端共享分区4对第二映射节点和第一归约节点共享,远端共享分区5对第二映射节点和第二归约节点共享,远端共享分区6对第二映射节点和第三归约节点共享。
基于此处的远端共享分区,对如何实现将所述数据段在对应的远端共享分区存储对步骤A402做一可选细化,所述第一映射节点将所述数据段以文件格式存储在远端共享分区中,包括:
所述第一映射节点将包含所述第一数据段的文件,存储至被所述第一映射节点和所述第一归约节点共享的远端共享分区。
所述第一映射节点将包含所述第一数据段的文件存储至被所述第一映射节点和所述第一归约节点共享的远端共享分区的具体实现细节,可以参见实施例一和实施例二中所述第一映射节点将包含所述第一数据段存储在远端共享分区1中的实现细节。由于是一个远端共享分区存储一个包含数据段的文件,因此,不同文件的文件名可以相同或者不同。
实施例四
实施例四对上述实施例一和实施例二作相应扩展,从归约节点的角度提供了数据处理方法的基础工作流程;该基础工作流程所适用的系统与实施例三从映射节点的角度提供的数据处理方法的基础工作流程所适用的系统为同一个系统,在此不再赘述。
如图6所示的基础工作流程是从第一归约节点的角度给出的,图6提供的基础工作流程包括:步骤A601、步骤A602和步骤A603。
步骤A601,所述第一归约节点获知所述第一映射节点将第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息,所述数据请求消息包含所述第一归约节点的标识,其中,所述第一数据段是指在所述第一映射节点根据其执行映射任务所得的执行结果获取到的至少一个数据段中由所述第一归约节点处理的数据段;
步骤A602,所述第一归约节点接收所述第一映射节点反馈的响应消息, 所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,所述响应消息由所述第一映射节点在响应所述数据请求消息时根据所述第一归约节点的标识生成;
步骤A603,所述第一归约节点根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,对所述第一数据段执行归约任务(即实施例一和实施例二所述的Reduce任务)。
对于第一归约节点执行步骤A601、步骤A602和步骤A603的实现细节,可参见实施例一、实施例二和实施例三中第一归约节点执行相关步骤的对应描述。
可选地,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;例如,实施例一和实施例二分别示意为6个远端共享分区,包括远端共享分区1、远端共享分区2、远端共享分区3、远端共享分区4、远端共享分区5、远端共享分区6;远端共享分区1对第一映射节点和第一归约节点共享,远端共享分区2对第一映射节点和第二归约节点共享,远端共享分区3对第一映射节点和第三归约节点共享,远端共享分区4对第二映射节点和第一归约节点共享,远端共享分区5对第二映射节点和第二归约节点共享,远端共享分区6对第二映射节点和第三归约节点共享。
基于此处的远端共享分区,对步骤A603中所述第一归约节点根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,参见图7,包括步骤A701和步骤A702。
步骤A701,所述第一归约节点根据所述响应信息中的所述分区名称,确定被所述第一映射节点和所述第一归约节点共享的远端共享分区;
步骤A702,所述第一归约节点从被所述第一映射节点和所述第一归约节点共享的远端共享分区中,根据所述响应信息中的所述文件名读取所述第一数据段。
对于第一归约节点执行步骤A701和步骤A702的实现细节,可参见实施例一和实施例二中第一归约节点根据所述响应信息从被所述第一映射节点和所述第一归约节点共享的远端共享分区中读取第一数据段的相关细节描述。
可选地,从所述存储池为各个归约节点分别分配了非共享的存储空间,此处定义各个归约节点节点私有的存储空间为第二远端私有分区,因此各个归约节点均具有独享的第二远端私有分区,可见存储池包括被第一归约节点独享的第二远端私有分区。基于使用第二远端私有分区存储数据段,所述方法包括:
所述第一归约节点在将从所述远端共享分区读取的所述第一数据段存储在本地内存的过程中,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
可选地,所述第一归约节点通过挂载方式访问所述远端共享分区。可选地,所述存储池为内存池。
实施例四从第一归约节点提供的方法流程,是与实施例三从第一映射节点提供的方法流程对应的,此处不再详述,具体描述参见实施例一、实施例二、实施例三。
实施例五
实施例五提供的数据处理装置,包含可实现实施例三提供的方法流程的装置单元;鉴于篇幅,此处对实施例五提供的各功能单元所执行的动作,不再做具体描述,可直接参见实施例一、实施例二、实施例三提供的方法流程中提供的对应动作描述。
参见图8,本实施例提供的数据处理装置801,所适用的系统包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接;所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点,所述至少一个映射节点包括第一映射节点,所述至少一个归约节点包括第一归约节点,所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上;所述存储池包括的远端共享分区被所述第一映射节点和所述第一归约节点共享;所述数据处理装置801作为所述第一映射节点,所述数据处理装置801包括:
获取单元802,用于对数据分片执行映射任务,根据所述映射任务的执 行结果获取至少一个数据段,所述至少一个数据段中的每个数据段由一个与之相对应的归约节点处理,其中,所述至少一个数据段包括第一数据段,第一数据段是指由所述第一归约节点处理的数据段;
存储单元803,用于将所述第一数据段以文件格式存储在所述远端共享分区中;
响应单元804,用于接收所述第一归约节点发送的数据请求消息,所述数据请求消息包含所述第一归约节点的标识,响应所述数据请求消息并根据所述第一归约节点的标识生成响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,向所述第一归约节点反馈所述响应消息。
可选地,所述存储池包括被第一映射节点独享的第一远端私有分区;所述获取单元802,用于根据所述映射任务的执行结果获取至少一个数据段,具体包括:
所述获取单元802,用于将所述第一映射节点执行映射任务时溢写的溢写文件存储至所述第一映射节点的第一远端私有分区,其中,单个所述溢写文件包括所述第一映射节点执行所述映射任务过程中单次从缓存所述映射任务的执行结果的缓冲区中溢写出的数据;
所述获取单元802,用于对所述第一映射节点的第一远端私有分区中存储的多个所述溢写文件根据所述至少一个归约节点中不同所述归约节点所对应的键值分别合并,合并得到所述至少一个数据段。
可选地,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;所述存储单元803,用于将所述数据段以文件格式存储在远端共享分区中,具体包括:
所述存储单元803,用于将包含所述第一数据段的文件,存储至被所述第一映射节点和所述第一归约节点共享的远端共享分区。
可选地,所述第一映射节点通过挂载方式访问所述远端共享分区。
可选地,所述存储池为内存池。
实施例六
实施例六提供的数据处理装置,包含可以实现实施例四提供的方法流程的装置单元;鉴于篇幅,此处对实施例五提供的各功能单元所执行的动作,不再做具体描述,可直接参见实施例一、实施例二、实施例三、实施例四提供的方法流程中提供的对应动作描述。
参见图9,本实施例提供的数据处理装置901,所适用的系统包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接;所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点,所述至少一个映射节点包括第一映射节点,所述至少一个归约节点包括第一归约节点,所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上;所述存储池包括的远端共享分区被所述第一映射节点和所述第一归约节点共享。所述数据处理装置901作为所述第一归约节点;所述数据处理装置901包括:
请求单元902,用于获知所述第一映射节点将第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息,所述数据请求消息包含所述第一归约节点的标识,其中,所述第一数据段是指在所述第一映射节点根据其执行映射任务所得的执行结果获取到的至少一个数据段中由所述第一归约节点处理的数据段;
接收单元903,用于接收所述第一映射节点反馈的响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,所述响应消息由所述第一映射节点在响应所述数据请求消息时根据所述第一归约节点的标识生成;
执行单元904,用于根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,对所述第一数据段执行归约任务。
可选地,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;所述执行单元904,用于根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,包括:
所述执行单元904,用于根据所述响应信息中的所述分区名称,确定被 所述第一映射节点和所述第一归约节点共享的远端共享分区;
所述执行单元904,用于从被所述第一映射节点和所述第一归约节点共享的远端共享分区中,根据所述响应信息中的所述文件名读取所述第一数据段。
可选地,所述存储池包括被第一归约节点独享的第二远端私有分区;所述归约节点包括:
转存储单元905,用于在将从所述远端共享分区读取的所述第一数据段存储在本地内存的过程中,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
可选地,所述第一归约节点通过挂载方式访问所述远端共享分区。
可选地,所述存储池为内存池。
实施例七
实施例七提供了一种执行实施例三和/或实施例四提供的方法步骤的硬件设备,参见图10,该硬件设备为计算机设备1000;图10所示的计算机设备1000为本发明实施例上述的已解耦的计算机设备,计算机设备1000包括处理器1001与存储器1002,处理器1001与存储器1002通过总线1003连接;
所述存储器1002用于存储计算机指令,当所述计算机设备1000时,所述处理器1001执行所述存储器1002存储的所述计算机指令,以使所述计算机设备1000执行实施例三和/或实施例四提供的数据处理方法。该数据处理方法中各个步骤的具体实现,参见实施例一、实施例二、实施例三、实施例四对各个步骤的对应描述,在此不再赘述。
其中,处理器1001可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以实现上述方法实施例所提供的技术方案。当然,处理器1001可以是CPU池中的CPU。
其中,存储器1002可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory, RAM)。存储器1002存储实现上述方法实施例提供的技术方案的程序代码,还可以存储操作系统程序。在通过软件或者固件来实现上述方法实施例提供的技术方案时,由处理器1001来执行。存储器1002可以是存储池中的存储介质,也可以是本地的存储介质,例如本地磁盘。
其中,总线1003可包括一通路,用于在各个部件(例如处理器1001、存储器1002、输入/输出接口1005、通信接口1004)之间传送信息。
其中,输入/输出接口1005,输入/输出接口1005用于接收输入的数据和信息,输出操作结果等数据。
其中,通信接口1004使用例如但不限于收发器一类的收发装置,来实现处理器1001与其他设备或通信网络之间的网络通信;可选地,通信接口1004可以是用于接入网络的各种接口,如用于接入以太网的以太网接口,该以太网接口包括但不限于RJ-45接口、RJ-11接口、SC光纤接口、FDDI接口、AUI接口、BNC接口和Console接口等。
输入/输出接口1005和通信接口1004可以是本地的,也可以是图1中I/O池104中的。
应注意,也许使用处理器1001、存储器1002以及总线1003即可实现上述方法实施例;但是在不同应用场合实现上述方法实施例时,本领域的技术人员应当明白,还可能需要适合在该应用场合实现上述方法实施例所必须的其他器件,例如通信接口1004、输入/输出接口1005。
实施例八
实施例八提供一种系统,为实施例三、实施例四提供的数据处理方法所适用的系统,可参见实施例三、实施例四对系统的细节描述,在此不再赘述。
此处的第一映射节点与上述方法实施例(实施例一、实施例二、实施例三、实施例四)提供的第一映射节点,具有的功能相同,同原理地执行动作;此处的第一归约节点与上述方法实施例(实施例一、实施例二、实施例三、实施例四)提供的第一归约节点,具有的功能相同,同原理地执行动作;鉴于篇幅,此处对实施例八中第一映射节点和第一归约节点所具体的功能和可以执行的工作,不再做具体描述,可直接参见实施例一、实施例二、实施例 三、实施例四提供的方法流程中提供的对应动作描述。可选地,第一映射节点由实施例五的数据处理装置实现,第一归约节点由实施例六的数据处理装置实现。
所述第一映射节点,用于对数据分片执行映射任务,根据所述映射任务的执行结果获取至少一个数据段,所述至少一个数据段中的每个数据段由一个与之相对应的归约节点处理,其中,所述至少一个数据段包括第一数据段,第一数据段是指由所述第一归约节点处理的数据段;将所述第一数据段以文件格式存储在所述远端共享分区中;响应所述数据请求消息,并根据所述第一归约节点的标识生成响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,向所述第一归约节点反馈所述响应消息;
所述第一归约节点,用于获知所述第一映射节点将所述第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息,所述数据请求消息包含所述第一归约节点的标识;接收所述第一映射节点反馈的响应消息,根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,对所述第一数据段执行归约任务。
可选地,所述存储池包括被第一映射节点独享的第一远端私有分区;
所述第一映射节点,用于根据所述映射任务的执行结果获取至少一个数据段,具体包括:所述第一映射节点,用于将所述第一映射节点执行映射任务时溢写的溢写文件存储至所述第一映射节点的第一远端私有分区,其中,单个所述溢写文件包括所述第一映射节点执行所述映射任务过程中单次从缓存所述映射任务的执行结果的缓冲区中溢写出的数据;对所述第一映射节点的第一远端私有分区中存储的多个所述溢写文件根据所述至少一个归约节点中不同所述归约节点所对应的键值分别合并,合并得到所述至少一个数据段。
可选地,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;
所述第一映射节点,用于将所述数据段以文件格式存储在远端共享分区中,具体包括:所述第一映射节点,用于将包含所述第一数据段的文件,存 储至被所述第一映射节点和所述第一归约节点共享的远端共享分区;
所述第一归约节点,用于根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,具体为:所述第一归约节点,用于根据所述响应信息中的所述分区名称,确定被所述第一映射节点和所述第一归约节点共享的远端共享分区,从被所述第一映射节点和所述第一归约节点共享的远端共享分区中,根据所述响应信息中的所述文件名读取所述第一数据段。
可选地,所述存储池包括被第一归约节点独享的第二远端私有分区;
所述第一归约节点,还用于在将从所述远端共享分区读取的所述第一数据段存储在本地内存的过程中,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
可选地,所述第一映射节点,用于通过挂载方式访问所述远端共享分区;
所述第一归约节点,用于通过挂载方式访问所述远端共享分区。
应当理解,尽管在上述实施例中可能采用术语“第一”、“第二”、“第三”等来描述各个单元、存储消息、归约节点、映射节点,例如描述“第一归约节点”、“第二归约节点”、“第三归约节点”,但不应限于这些术语,并且“第一”、“第二”、“第三”等术语仅用于相互区分,并不代表它们之间存在顺序的关系;例如“第一归约节点”、“第二归约节点”并不代表特指的归约节点,也不代表它们之间存在顺序的关系,“第一”和“第二”仅用来将比较输入端口彼此区分开,在不脱离本发明实施例范围的情况下,可以对“第一归约节点”、“第二归约节点”互换名称,或者将“第一归约节点”改称为“第四归约节点”;因此,在本发明实施例中,对术语“第一”、“第二”等不做限制。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。

Claims (27)

  1. 一种数据处理方法,其特征在于,所适用的系统包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接;所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点,所述至少一个映射节点包括第一映射节点,所述至少一个归约节点包括第一归约节点,所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上;所述存储池包括的远端共享分区被所述第一映射节点和所述第一归约节点共享;所述方法包括:
    所述第一映射节点对数据分片执行映射任务,根据所述映射任务的执行结果获取至少一个数据段,所述至少一个数据段中的每个数据段由一个与之相对应的归约节点处理,其中,所述至少一个数据段包括第一数据段,第一数据段是指由所述第一归约节点处理的数据段;
    所述第一映射节点将所述第一数据段以文件格式存储在所述远端共享分区中;
    所述第一映射节点接收所述第一归约节点发送的数据请求消息,所述数据请求消息包含所述第一归约节点的标识;
    所述第一映射节点响应所述数据请求消息,并根据所述第一归约节点的标识生成响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名;
    所述第一映射节点向所述第一归约节点反馈所述响应消息。
  2. 根据权利要求1所述的方法,其特征在于,所述存储池包括被第一映射节点独享的第一远端私有分区,所述第一映射节点根据所述映射任务的执行结果获取至少一个数据段,具体包括:
    将所述第一映射节点执行映射任务时溢写的溢写文件存储至所述第一映射节点的第一远端私有分区,其中,单个所述溢写文件包括所述第一映射节点执行所述映射任务过程中单次从缓存所述映射任务的执行结果的缓冲区中溢写出的数据;
    对所述第一映射节点的第一远端私有分区中存储的多个所述溢写文件根据所述至少一个归约节点中不同所述归约节点所对应的键值分别合并,合 并得到所述至少一个数据段。
  3. 根据权利要求1或2所述的方法,其特征在于,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;所述第一映射节点将所述数据段以文件格式存储在远端共享分区中,包括:
    所述第一映射节点将包含所述第一数据段的文件,存储至被所述第一映射节点和所述第一归约节点共享的远端共享分区。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述第一映射节点通过挂载方式访问所述远端共享分区。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述存储池为内存池。
  6. 一种数据处理装置,其特征在于,所适用的系统包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接;所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点,所述至少一个映射节点包括第一映射节点,所述至少一个归约节点包括第一归约节点,所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上;所述存储池包括的远端共享分区被所述第一映射节点和所述第一归约节点共享;所述装置作为所述第一映射节点,所述装置包括:
    获取单元,用于对数据分片执行映射任务,根据所述映射任务的执行结果获取至少一个数据段,所述至少一个数据段中的每个数据段由一个与之相对应的归约节点处理,其中,所述至少一个数据段包括第一数据段,第一数据段是指由所述第一归约节点处理的数据段;
    存储单元,用于将所述第一数据段以文件格式存储在所述远端共享分区中;
    响应单元,用于接收所述第一归约节点发送的数据请求消息,所述数据请求消息包含所述第一归约节点的标识,响应所述数据请求消息并根据所述第一归约节点的标识生成响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,向所述第一归约节点反馈所述响应消息。
  7. 根据权利要求6所述的数据处理装置,其特征在于,所述存储池包 括被第一映射节点独享的第一远端私有分区;所述获取单元,用于根据所述映射任务的执行结果获取至少一个数据段,具体包括:
    所述获取单元,用于将所述第一映射节点执行映射任务时溢写的溢写文件存储至所述第一映射节点的第一远端私有分区,其中,单个所述溢写文件包括所述第一映射节点执行所述映射任务过程中单次从缓存所述映射任务的执行结果的缓冲区中溢写出的数据;
    所述获取单元,用于对所述第一映射节点的第一远端私有分区中存储的多个所述溢写文件根据所述至少一个归约节点中不同所述归约节点所对应的键值分别合并,合并得到所述至少一个数据段。
  8. 根据权利要求6或7所述的数据处理装置,其特征在于,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;所述存储单元,用于将所述数据段以文件格式存储在远端共享分区中,具体包括:
    所述存储单元,用于将包含所述第一数据段的文件,存储至被所述第一映射节点和所述第一归约节点共享的远端共享分区。
  9. 根据权利要求6至8任一项所述的数据处理装置,其特征在于,所述第一映射节点通过挂载方式访问所述远端共享分区。
  10. 根据权利要求6至9任一项所述的数据处理装置,其特征在于,所述存储池为内存池。
  11. 一种计算机设备,所述计算机设备包括处理器和存储器,所述处理器与存储器总线连接,其特征在于,所述存储器用于存储计算机指令,当所述计算机设备运行时,所述处理器执行所述存储器存储的所述计算机指令,以使所述计算机设备执行权利要求1至5任一项所述的数据处理方法。
  12. 一种数据处理方法,其特征在于,所适用的系统包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接;所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点,所述至少一个映射节点包括第一映射节点,所述至少一个归约节点包括第一归约节点,所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上;所述存储池包括的远端共享分区被所述第一映射节点和所述第一归 约节点共享;所述方法包括:
    所述第一归约节点获知所述第一映射节点将第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息,所述数据请求消息包含所述第一归约节点的标识,其中,所述第一数据段是指在所述第一映射节点根据其执行映射任务所得的执行结果获取到的至少一个数据段中由所述第一归约节点处理的数据段;
    所述第一归约节点接收所述第一映射节点反馈的响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,所述响应消息由所述第一映射节点在响应所述数据请求消息时根据所述第一归约节点的标识生成;
    所述第一归约节点根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,对所述第一数据段执行归约任务。
  13. 根据权利要求12所述的方法,其特征在于,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;所述第一归约节点根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,包括:
    所述第一归约节点根据所述响应信息中的所述分区名称,确定被所述第一映射节点和所述第一归约节点共享的远端共享分区;
    所述第一归约节点从被所述第一映射节点和所述第一归约节点共享的远端共享分区中,根据所述响应信息中的所述文件名读取所述第一数据段。
  14. 根据权利要求12或13所述的方法,其特征在于,所述存储池包括被第一归约节点独享的第二远端私有分区;所述方法包括:
    所述第一归约节点在将从所述远端共享分区读取的所述第一数据段存储在本地内存的过程中,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
  15. 根据权利要求12至14任一项所述的方法,其特征在于,所述第一归约节点通过挂载方式访问所述远端共享分区。
  16. 根据权利要求12至15任一项所述的方法,其特征在于,所述存储 池为内存池。
  17. 一种数据处理装置,其特征在于,所适用的系统包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接;所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点,所述至少一个映射节点包括第一映射节点,所述至少一个归约节点包括第一归约节点,所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上;所述存储池包括的远端共享分区被所述第一映射节点和所述第一归约节点共享;所述装置作为所述第一归约节点,所述装置包括:
    请求单元,用于获知所述第一映射节点将第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息,所述数据请求消息包含所述第一归约节点的标识,其中,所述第一数据段是指在所述第一映射节点根据其执行映射任务所得的执行结果获取到的至少一个数据段中由所述第一归约节点处理的数据段;
    接收单元,用于接收所述第一映射节点反馈的响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,所述响应消息由所述第一映射节点在响应所述数据请求消息时根据所述第一归约节点的标识生成;
    执行单元,用于根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,对所述第一数据段执行归约任务。
  18. 根据权利要求17所述的数据处理装置,其特征在于,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;所述执行单元,用于根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,包括:
    所述执行单元,用于根据所述响应信息中的所述分区名称,确定被所述第一映射节点和所述第一归约节点共享的远端共享分区;
    所述执行单元,用于从被所述第一映射节点和所述第一归约节点共享的远端共享分区中,根据所述响应信息中的所述文件名读取所述第一数据段。
  19. 根据权利要求17或18所述的数据处理装置,其特征在于,所述存储池包括被第一归约节点独享的第二远端私有分区;所述归约节点包括:
    转存储单元,用于在将从所述远端共享分区读取的所述第一数据段存储在本地内存的过程中,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
  20. 根据权利要求17至19任一项所述的数据处理装置,其特征在于,所述第一归约节点通过挂载方式访问所述远端共享分区。
  21. 根据权利要求17至20任一项所述的数据处理装置,其特征在于,所述存储池为内存池。
  22. 一种计算机设备,所述计算机设备包括处理器和存储器,所述处理器与存储器总线连接,其特征在于,所述存储器用于存储计算机指令,当所述计算机设备运行时,所述处理器执行所述存储器存储的所述计算机指令,以使所述计算机设备执行权利要求12至16任一项所述的数据处理方法。
  23. 一种系统,其特征在于,包括中央处理器CPU池和存储池,所述CPU池与所述存储池通信连接;所述CPU池包含至少两个CPU,所述CPU池上运行有至少一个映射节点和至少一个归约节点,所述至少一个映射节点包括第一映射节点,所述至少一个归约节点包括第一归约节点,所述第一映射节点和所述第一归约节点运行在所述CPU池中的不同CPU上;所述存储池包括的远端共享分区被所述第一映射节点和所述第一归约节点共享;
    所述第一映射节点,用于对数据分片执行映射任务,根据所述映射任务的执行结果获取至少一个数据段,所述至少一个数据段中的每个数据段由一个与之相对应的归约节点处理,其中,所述至少一个数据段包括第一数据段,第一数据段是指由所述第一归约节点处理的数据段;将所述第一数据段以文件格式存储在所述远端共享分区中;响应所述数据请求消息,并根据所述第一归约节点的标识生成响应消息,所述响应消息包括存储所述第一数据段的远端共享分区的分区名称和包含所述第一数据段的文件的文件名,向所述第一归约节点反馈所述响应消息;
    所述第一归约节点,用于获知所述第一映射节点将所述第一数据段存储到所述远端共享分区后,向所述第一映射节点发送数据请求消息,所述数据请求消息包含所述第一归约节点的标识;接收所述第一映射节点反馈的响应消息,根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第 一数据段,对所述第一数据段执行归约任务。
  24. 根据权利要求23所述的系统,其特征在于,所述存储池包括被第一映射节点独享的第一远端私有分区;
    所述第一映射节点,用于根据所述映射任务的执行结果获取至少一个数据段,具体包括:所述第一映射节点,用于将所述第一映射节点执行映射任务时溢写的溢写文件存储至所述第一映射节点的第一远端私有分区,其中,单个所述溢写文件包括所述第一映射节点执行所述映射任务过程中单次从缓存所述映射任务的执行结果的缓冲区中溢写出的数据;对所述第一映射节点的第一远端私有分区中存储的多个所述溢写文件根据所述至少一个归约节点中不同所述归约节点所对应的键值分别合并,合并得到所述至少一个数据段。
  25. 根据权利要求23或24所述的系统,其特征在于,所述远端共享分区的个数等于所述映射节点的个数与所述归约节点的个数的乘积,每个所述远端共享分区被一个所述映射节点和一个所述归约节点共享;
    所述第一映射节点,用于将所述数据段以文件格式存储在远端共享分区中,具体包括:所述第一映射节点,用于将包含所述第一数据段的文件,存储至被所述第一映射节点和所述第一归约节点共享的远端共享分区;
    所述第一归约节点,用于根据所述响应信息从所述远端共享分区存储的所述文件中获取所述第一数据段,具体为:所述第一归约节点,用于根据所述响应信息中的所述分区名称,确定被所述第一映射节点和所述第一归约节点共享的远端共享分区,从被所述第一映射节点和所述第一归约节点共享的远端共享分区中,根据所述响应信息中的所述文件名读取所述第一数据段。
  26. 根据权利要求23至25任一项所述的系统,其特征在于,所述存储池包括被第一归约节点独享的第二远端私有分区;
    所述第一归约节点,还用于在将从所述远端共享分区读取的所述第一数据段存储在本地内存的过程中,在所述本地内存的使用率达到预设使用率时,将后续从所述远端共享分区读取的所述第一数据段存储至所述第一归约节点的第二远端私有分区。
  27. 根据权利要求23至26任一项所述的系统,其特征在于,
    所述第一映射节点,用于通过挂载方式访问所述远端共享分区;
    所述第一归约节点,用于通过挂载方式访问所述远端共享分区。
PCT/CN2015/100081 2015-12-31 2015-12-31 数据处理方法、装置和系统 WO2017113278A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201580063657.5A CN108027801A (zh) 2015-12-31 2015-12-31 数据处理方法、装置和系统
PCT/CN2015/100081 WO2017113278A1 (zh) 2015-12-31 2015-12-31 数据处理方法、装置和系统
EP15911904.9A EP3376399A4 (en) 2015-12-31 2015-12-31 Data processing method, apparatus and system
US16/006,503 US10915365B2 (en) 2015-12-31 2018-06-12 Determining a quantity of remote shared partitions based on mapper and reducer nodes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/100081 WO2017113278A1 (zh) 2015-12-31 2015-12-31 数据处理方法、装置和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/006,503 Continuation US10915365B2 (en) 2015-12-31 2018-06-12 Determining a quantity of remote shared partitions based on mapper and reducer nodes

Publications (1)

Publication Number Publication Date
WO2017113278A1 true WO2017113278A1 (zh) 2017-07-06

Family

ID=59224116

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/100081 WO2017113278A1 (zh) 2015-12-31 2015-12-31 数据处理方法、装置和系统

Country Status (4)

Country Link
US (1) US10915365B2 (zh)
EP (1) EP3376399A4 (zh)
CN (1) CN108027801A (zh)
WO (1) WO2017113278A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107643991A (zh) * 2017-09-22 2018-01-30 算丰科技(北京)有限公司 数据处理芯片和系统、数据存储转发处理方法
CN111324428A (zh) * 2019-09-20 2020-06-23 杭州海康威视系统技术有限公司 任务分配方法、装置、设备和计算机可读存储介质
WO2023040348A1 (zh) * 2021-09-14 2023-03-23 华为技术有限公司 分布式系统中数据处理的方法以及相关系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11475012B2 (en) * 2016-09-26 2022-10-18 Singlestore, Inc. Real-time data retrieval
WO2019036901A1 (zh) * 2017-08-22 2019-02-28 华为技术有限公司 一种加速处理方法及设备
US10831546B2 (en) * 2017-11-27 2020-11-10 International Business Machines Corporation Computing task management using tree structures
US10956593B2 (en) * 2018-02-15 2021-03-23 International Business Machines Corporation Sharing of data among containers running on virtualized operating systems
CN111240824B (zh) * 2018-11-29 2023-05-02 中兴通讯股份有限公司 一种cpu资源调度方法及电子设备
CN112037874B (zh) * 2020-09-03 2022-09-13 合肥工业大学 一种基于映射归约的分布式数据处理方法
CN115203133A (zh) * 2021-04-14 2022-10-18 华为技术有限公司 数据处理方法、装置、归约服务器及映射服务器
CN114237510A (zh) * 2021-12-17 2022-03-25 北京达佳互联信息技术有限公司 数据处理方法、装置、电子设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101764835A (zh) * 2008-12-25 2010-06-30 华为技术有限公司 基于MapReduce编程架构的任务分配方法及装置
CN102255962A (zh) * 2011-07-01 2011-11-23 成都市华为赛门铁克科技有限公司 一种分布式存储方法、装置和系统
CN103324533A (zh) * 2012-03-22 2013-09-25 华为技术有限公司 分布式数据处理方法、装置及系统
US20130253888A1 (en) * 2012-03-22 2013-09-26 Microsoft Corporation One-pass statistical computations
CN103377091A (zh) * 2012-04-26 2013-10-30 国际商业机器公司 用于资源共享池中的作业的高效执行的方法和系统
US20140123115A1 (en) * 2012-10-26 2014-05-01 Jsmapreduce Corporation Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features
US20140358869A1 (en) * 2013-05-31 2014-12-04 Samsung Sds Co., Ltd. System and method for accelerating mapreduce operation
CN104331322A (zh) * 2014-10-24 2015-02-04 华为技术有限公司 一种进程迁移方法和装置
CN104331330A (zh) * 2014-10-27 2015-02-04 华为技术有限公司 资源池生成方法以及装置

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1370966B1 (en) 2001-02-24 2010-08-25 International Business Machines Corporation A novel massively parrallel supercomputer
CN102209087B (zh) 2010-03-31 2014-07-09 国际商业机器公司 在具有存储网络的数据中心进行MapReduce数据传输的方法和系统
US8954967B2 (en) * 2011-05-31 2015-02-10 International Business Machines Corporation Adaptive parallel data processing
EP2746942A4 (en) * 2011-08-15 2015-09-30 Nec Corp DISTRIBUTED PROCESS MANAGEMENT DEVICE AND DISTRIBUTED PROCESS MANAGEMENT METHOD
CN102663207B (zh) 2012-04-28 2016-09-07 浪潮电子信息产业股份有限公司 一种利用gpu加速量子介观体系求解的方法
JP5935889B2 (ja) * 2012-08-02 2016-06-15 富士通株式会社 データ処理方法、情報処理装置およびプログラム
US20140059552A1 (en) * 2012-08-24 2014-02-27 International Business Machines Corporation Transparent efficiency for in-memory execution of map reduce job sequences
US20140236977A1 (en) * 2013-02-19 2014-08-21 International Business Machines Corporation Mapping epigenetic surprisal data througth hadoop type distributed file systems
US9152601B2 (en) * 2013-05-09 2015-10-06 Advanced Micro Devices, Inc. Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
US9424274B2 (en) * 2013-06-03 2016-08-23 Zettaset, Inc. Management of intermediate data spills during the shuffle phase of a map-reduce job
US10019727B2 (en) * 2013-10-09 2018-07-10 Selligent, Inc. System and method for managing message campaign data
US20150127691A1 (en) * 2013-11-01 2015-05-07 Cognitive Electronics, Inc. Efficient implementations for mapreduce systems
US9389994B2 (en) * 2013-11-26 2016-07-12 International Business Machines Corporation Optimization of map-reduce shuffle performance through shuffler I/O pipeline actions and planning
US9910860B2 (en) * 2014-02-06 2018-03-06 International Business Machines Corporation Split elimination in MapReduce systems
US9886310B2 (en) * 2014-02-10 2018-02-06 International Business Machines Corporation Dynamic resource allocation in MapReduce
US10375161B1 (en) * 2014-04-09 2019-08-06 VCE IP Holding Company LLC Distributed computing task management system and method
US10291696B2 (en) * 2014-04-28 2019-05-14 Arizona Board Of Regents On Behalf Of Arizona State University Peer-to-peer architecture for processing big data
US9996597B2 (en) * 2014-06-06 2018-06-12 The Mathworks, Inc. Unified mapreduce framework for large-scale data processing
US20160092493A1 (en) * 2014-09-29 2016-03-31 International Business Machines Corporation Executing map-reduce jobs with named data
US9367344B2 (en) * 2014-10-08 2016-06-14 Cisco Technology, Inc. Optimized assignments and/or generation virtual machine for reducer tasks
US10102029B2 (en) * 2015-06-30 2018-10-16 International Business Machines Corporation Extending a map-reduce framework to improve efficiency of multi-cycle map-reduce jobs

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101764835A (zh) * 2008-12-25 2010-06-30 华为技术有限公司 基于MapReduce编程架构的任务分配方法及装置
CN102255962A (zh) * 2011-07-01 2011-11-23 成都市华为赛门铁克科技有限公司 一种分布式存储方法、装置和系统
CN103324533A (zh) * 2012-03-22 2013-09-25 华为技术有限公司 分布式数据处理方法、装置及系统
US20130253888A1 (en) * 2012-03-22 2013-09-26 Microsoft Corporation One-pass statistical computations
CN103377091A (zh) * 2012-04-26 2013-10-30 国际商业机器公司 用于资源共享池中的作业的高效执行的方法和系统
US20140123115A1 (en) * 2012-10-26 2014-05-01 Jsmapreduce Corporation Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features
US20140358869A1 (en) * 2013-05-31 2014-12-04 Samsung Sds Co., Ltd. System and method for accelerating mapreduce operation
CN104331322A (zh) * 2014-10-24 2015-02-04 华为技术有限公司 一种进程迁移方法和装置
CN104331330A (zh) * 2014-10-27 2015-02-04 华为技术有限公司 资源池生成方法以及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIU, CHAO ET AL.: "Graph Data Processing Technology in Cloud Platform", JOURNAL OF COMPUTER APPLICATIONS, vol. 35, no. 1, 1 October 2015 (2015-10-01), pages 43 - 44, XP009508299, ISSN: 1001-9081 *
See also references of EP3376399A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107643991A (zh) * 2017-09-22 2018-01-30 算丰科技(北京)有限公司 数据处理芯片和系统、数据存储转发处理方法
CN107643991B (zh) * 2017-09-22 2023-09-19 北京算能科技有限公司 数据处理芯片和系统、数据存储转发处理方法
CN111324428A (zh) * 2019-09-20 2020-06-23 杭州海康威视系统技术有限公司 任务分配方法、装置、设备和计算机可读存储介质
CN111324428B (zh) * 2019-09-20 2023-08-22 杭州海康威视系统技术有限公司 任务分配方法、装置、设备和计算机可读存储介质
WO2023040348A1 (zh) * 2021-09-14 2023-03-23 华为技术有限公司 分布式系统中数据处理的方法以及相关系统

Also Published As

Publication number Publication date
CN108027801A (zh) 2018-05-11
EP3376399A4 (en) 2018-12-19
US20180293108A1 (en) 2018-10-11
US10915365B2 (en) 2021-02-09
EP3376399A1 (en) 2018-09-19

Similar Documents

Publication Publication Date Title
WO2017113278A1 (zh) 数据处理方法、装置和系统
KR102444832B1 (ko) 분산된 가상 명칭 공간 관리를 사용한 온-디맨드 스토리지 프로비져닝
US10091295B1 (en) Converged infrastructure implemented with distributed compute elements
US10241550B2 (en) Affinity aware parallel zeroing of memory in non-uniform memory access (NUMA) servers
CN108431796B (zh) 分布式资源管理系统和方法
US10248346B2 (en) Modular architecture for extreme-scale distributed processing applications
WO2015176636A1 (zh) 分布式数据库服务管理系统
JP2019139759A (ja) ソリッドステートドライブ(ssd)及び分散データストレージシステム並びにその方法
US9092272B2 (en) Preparing parallel tasks to use a synchronization register
KR20140049064A (ko) 분리된 가상 공간을 제공하기 위한 방법 및 장치
TWI605340B (zh) 用於s列表分配之系統與方法
US8347293B2 (en) Mutual exclusion domains to perform file system processes on stripes
US10599436B2 (en) Data processing method and apparatus, and system
CN114510321A (zh) 资源调度方法、相关装置和介质
WO2016095760A1 (zh) 数据动态重分布的方法、数据节点、名字节点及系统
US10824640B1 (en) Framework for scheduling concurrent replication cycles
US20230251904A1 (en) Quantum isolation zone orchestration
US11683374B2 (en) Containerized gateways and exports for distributed file systems
US20230196161A1 (en) Local services in quantum isolation zones

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15911904

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2015911904

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE