WO2012024937A1 - Method and system for realizing parallel computing - Google Patents

Method and system for realizing parallel computing Download PDF

Info

Publication number
WO2012024937A1
WO2012024937A1 PCT/CN2011/072818 CN2011072818W WO2012024937A1 WO 2012024937 A1 WO2012024937 A1 WO 2012024937A1 CN 2011072818 W CN2011072818 W CN 2011072818W WO 2012024937 A1 WO2012024937 A1 WO 2012024937A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
information
task
worker node
master node
Prior art date
Application number
PCT/CN2011/072818
Other languages
French (fr)
Chinese (zh)
Inventor
周扬
胡媛
张艺夕
李桂萍
黄翔
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2012024937A1 publication Critical patent/WO2012024937A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements

Definitions

  • the present invention relates to the field of cloud computing, and more particularly to a method and system for implementing parallel computing. Background technique
  • MapReduce was first proposed by Google engineers. It is a system architecture that can process massive amounts of data in parallel.
  • the MapReduce system works by: automatically breaking a task into multiple subtasks and then executing these subtasks in parallel, when all subtasks are executed. After the completion, the processing results will be summarized.
  • FIG. 1 shows the architecture of the existing MapReduce system.
  • MapReduce divides the data processing into two phases: the Map phase and the Reduce phase.
  • the MapReduce system mainly includes: a client (Client), a host (Master) node, and a worker (Worker) node; wherein, the Client is used to submit a MapReduce task, and the Master node is used to automatically decompose the MapReduce task into a Map task and a Reduce task, and then These tasks are scheduled to be executed on the Worker node. After the Worker node receives the Map or Reduce task request from the Master node, it performs the task in the request.
  • the MapReduce system automatically implements parallel processing, distributed data, fault tolerance, and balanced load.
  • the present invention provides a method for implementing parallel computing, the method comprising:
  • the new worker node When the worker node performing the task fails, the new worker node obtains the log information of the recorded faulty worker node, and continues to process the business process of the faulty worker node according to the log information from the breakpoint at the time of the fault; and/or, when When the master node that performs the task fails, the new master node starts to obtain the log information of the faulty master node, and continues to process the service flow of the faulty master node from the breakpoint when the fault occurs.
  • the new worker node obtains log information of the faulty worker node
  • the new Worker node After receiving the information, the new Worker node sends the query request information to the global information monitoring function entity;
  • the global information monitoring function entity After receiving the query request information, the global information monitoring function entity searches the log information of the faulty worker node saved by itself according to the query request information, and returns the log information of the faulty worker node to the new worker node.
  • the new master node obtains the log information of the faulty master node, which is:
  • the global information monitoring function entity After receiving the query request information, the global information monitoring function entity searches for the log information of the faulty Master node saved according to the query request information, and returns the log information of the faulty Master node to the new Master node.
  • the method before recording the log information of the Master node and the Worker node, the method further includes:
  • a node is selected as the master node for executing the task, and then the input data source to be processed is sent to the selected master node of the execution task;
  • the master node After receiving the input data source to be processed, the master node performing the task performs segmentation processing on the input data source;
  • the master performing the task selects a worker node that executes the task, and assigns a task to be executed to each worker node that performs the task;
  • the worker node performing the task reads the divided data block and performs the assigned task.
  • the log information of the Worker node and the Master node that record the task is:
  • the worker node and the master node performing the task upload their own log information to the global information monitoring function entity in real time;
  • the global information monitoring function entity saves the log information of the Worker node and the Master node that perform the task.
  • the method further includes:
  • the global information monitoring function entity After receiving the log information uploaded by the worker node, the global information monitoring function entity determines whether the identity information of the node carried in the information of the worker node is consistent with the identity information of the saved worker node, and when the consistency is determined, the worker node is saved. Log information, indeed When the inconsistency is determined, the log information of the worker node is discarded.
  • the present invention also provides a method for obtaining log information, the method comprising:
  • the log information of the saved faulty worker node is searched according to the query request information, and the faulty worker node is returned to the new Worker node.
  • the master node performing the task fails and receives the query request information sent by the new master node, searches for the saved fault information of the master node according to the query request information, and The new Master node returns the fault information of the faulty Master node.
  • the method before saving the log information of the master node and the worker node performing the task in real time, the method further includes:
  • the consistency is determined, the information of the worker node is saved, and when the inconsistency is determined, the log information of the worker node is discarded.
  • the present invention also provides a global information monitoring entity that obtains log information, where the global information monitoring entity includes: a storage module and a query module;
  • the storage module is configured to save the log information uploaded by the master node and the worker node performing the task in real time after the whole task is started;
  • the query module is configured to: when the worker node performing the task fails and after receiving the query request information sent by the new worker node, search for the log information of the faulty worker node saved by the storage module according to the query request information, and send the log information to the new
  • the worker node returns the log information of the faulty worker node; and/or, when the master node performing the task fails and receives the query request information sent by the new master node, searches for the faulty master node saved by the storage module according to the query request information. Log information, and return the fault master to the new master node Log information for the node.
  • the global information monitoring entity further includes: a determining module, configured to
  • the worker node When the worker node uploads the log information, it determines whether the identity information of the node carried in the log information of the worker node is consistent with the identity information of the saved worker node. When the consistency is determined, the log information of the worker node is saved, and when the inconsistency is determined, Discard the log information of the Worker node.
  • the storage module is further configured to save the identity information of the worker node.
  • the present invention also provides a system for implementing parallel computing, the system comprising: a global information monitoring function entity, a first worker node, and/or a first master node;
  • the global information monitoring function entity is configured to record the log information of the worker node and the master node performing the task after the overall task is started;
  • the first worker node is configured to: when the worker node performing the task fails, obtain the log information of the faulty worker node from the global information monitoring function entity, and continue to process the service of the faulty worker node according to the log information from the breakpoint at the time of the fault occurrence.
  • the first master node is configured to: when the master node performing the task fails, obtain the log information of the faulty master node from the global information monitoring function entity after the self-starting, and according to the log information, the fault occurs. The breakpoint continues to process the business process of the failed master node.
  • the system further includes: a User Program unit, a second Master node, and a second Worker node; wherein
  • the User Program unit is configured to select the second master node as the master node for executing the task after initiating the overall task by calling the client library, and send the input data source to be processed to the second master node;
  • the second master node is set to receive the input that needs to be processed sent by the User Program unit. After entering the data source, the input data source is divided, and then the worker node that performs the task is selected, and the task that needs to be executed is assigned to each worker node that performs the task;
  • the second worker node is configured to perform the assigned task after receiving the task assigned by the second master node.
  • the second master node is further configured to: when the second worker node fails, send information about performing the task to the first worker node;
  • the first worker node is configured to: after receiving the information sent by the second master node, send the query request information to the global information monitoring function entity, and receive the log information of the second worker node returned by the global information monitoring function entity;
  • the global information monitoring function entity is further configured to: after receiving the query request information sent by the first worker node, search for the information of the second worker node saved by the first worker node according to the query request information, and return the second work to the first worker node. Log information of the Worker node.
  • the first master node is configured to: when the second master node fails, send query request information to the global information monitoring function entity, and receive log information of the second master node returned by the global information monitoring function entity. ;
  • the global information monitoring function entity is further configured to: after receiving the query request information sent by the first master node, search for the information of the second master node saved by the first master node according to the query request information, and return the second information to the first master node. Log information of the master node.
  • the second worker node is further configured to upload its own log information to the global information monitoring function entity in real time after the whole task is started;
  • the second master node is further configured to upload its own log information to the global information monitoring function entity in real time after the overall task is started;
  • the global information monitoring function entity is further configured to save log information of the second worker node and the second master node.
  • the global information monitoring function entity is further set to save the second Worker. Before the log information of the node and the second master node is determined, it is determined whether the identity information of the node carried in the log information of the second worker node is consistent with the identity information of the saved worker node, and when the consistency is determined, the log of the second worker node is saved. When the information is determined to be inconsistent, the log information of the second worker node is discarded.
  • the new worker node obtains the log information of the recorded faulty worker node, and continues to process the business process of the faulty worker node from the breakpoint at the time of the fault according to the log information; and/or
  • the new master obtains the log information of the faulty master node, and continues to process the service flow of the faulty master node from the breakpoint at the time of the fault according to the log information, so that when the node fails, the fault occurs at the moment of the fault. Continue to perform tasks at the point, thereby improving data processing efficiency, saving system resources, and improving user experience.
  • FIG. 1 is a schematic structural diagram of an existing MapReduce system
  • FIG. 2 is a schematic flowchart of a method for implementing parallel computing according to an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of a method before recording log information of a Master node and a Worker node according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a system for implementing parallel computing according to an embodiment of the present invention. detailed description
  • the method for implementing parallel computing according to the present invention includes the following steps:
  • Step 201 After the whole task is started, record the log information of the worker node and the master node that execute the task;
  • Step 301 After the user program starts the overall task by calling the client program library, selecting a node as the master node for executing the task, and then sending the input data source to be processed to the selected master node performing the task;
  • Step 302 After the master node performing the task receives the input data source to be processed, the input data source is segmented, and then step 303 is performed;
  • the Master node can call the split function in the User Program to divide the input data source; the User Program can inform the Master node of the calling program parameters in advance, or can send the calling function to the Master node in advance by means of a message.
  • Step 303 The node performing the task selects the Worker node that executes the task, and assigns a task to be executed to each Worker node that executes the task;
  • Step 304 The worker node performing the task reads the divided data block and performs the assigned task.
  • the step 30 is the same as the existing process, and is not described here.
  • the log information includes: status information of the node operation and status and key data of the service process flow; wherein, the status information of the node operation , which may be: network status, CPU, memory, disk space, execution status of a Map task or a Reduce task, etc.; the status and key data of the business process flow are related to the specific business process being processed, for example, A business process of using MapReduce to send short messages of weather forecast to 100,000 mobile phone users in parallel, the state and key data of the business process flow include phone number information of the mobile phone user; in actual application, it may be added in the MapReduce system.
  • a global information monitoring function entity records the log information of the master node and the worker node by the global information monitoring function entity, and configures the identity information of the global information monitoring function entity on all nodes in the MapReduce system in advance, and the global information monitoring function Entity identity Information may be interconnection between network protocols (IP, Internet Protocol) address, all the identification number (ID, Identity) can show information such as the identity of the entity of the global information monitoring function;
  • MapReduce system in the A node may upload its own log information to the global information monitoring function entity according to the identity information of the global information monitoring function entity; after the overall task is started, the master node and the worker node upload their own log information to the global information monitoring function in real time. entity;
  • the master node assigns the overall task to which worker nodes to execute, and sends the identity information of the worker nodes to the global information monitoring function entity, and the global information monitoring function entity receives and The identifier information of the worker node is saved. If the worker node uploads the log information, the global information monitoring function entity determines whether to save the log information of the worker node according to the saved identity information of the worker node, specifically, the log information of the worker node.
  • the identity information of the worker node refers to Information that identifies the identity of the Worker node, such as: IP address, machine name, or ID;
  • the specific form of the global information monitoring function entity may be a log database, or may be an aggregate composed of one or more nodes;
  • the Worker node refers to a collection of all Worker nodes that perform the task.
  • Step 202 When the worker node performing the task fails, the new worker node obtains the log information of the faulty worker node, and continues to process the business process of the faulty worker node according to the log information from the breakpoint when the fault occurs; and / Or, when the master node performing the task fails, the new master node starts, obtains the recorded fault information of the master node, and continues to process the faulty master node service from the breakpoint when the fault occurs according to the information. Process;
  • the master node can know that the worker node performing the task is faulty through the heartbeat detection between itself and the worker node; after the worker node performing the task fails, the master node can be based on the load of other nodes in the MapReduce system, namely: existing Automatic load balancing processing in the MapReduce system, selecting a node as a new Worker node; the new Worker node may be a healthy Worker node that is performing the task, or may be a healthy Worker node that does not perform the task. ;
  • the User Program of the MapReduce system starts a timer. After the timer expires, the task execution result returned by the master node has not been received.
  • the master node is considered to be faulty. You need to select a new node as the master.
  • the node when selected, can be based on the load of other nodes in the MapReduce system, that is, the automatic load balancing processing in the existing MapReduce system, and select a node as the new master node; the new master node can be executed.
  • the master node of the task may also be another master node that does not perform the task;
  • the new worker node obtains the log information of the faulty worker node, where specifically: the master node sends information about performing the task to the new worker node;
  • the new Worker node After receiving the information, the new Worker node sends the query request information to the global information monitoring function entity;
  • the global information monitoring function entity After receiving the query request information, the global information monitoring function entity searches for the fault information of the faulty worker node saved by itself according to the query request information, and returns the log information of the faulty worker node to the new worker node;
  • the information about the execution task includes a task data source, a task ID, and identity information of the faulty worker node;
  • the query request information includes a task ID, a node identifier information of the fault worker, and the like, and the node identifier information of the fault worker may be information such as an IP address, a machine name, an ID, and the like, which can identify the identity of the faulty worker node;
  • the new master node obtains the log information of the faulty master node, specifically: the new master node sends the query request information to the global information monitoring function entity; after receiving the query request information, the global information monitoring function entity according to the query request information Find Log information of the faulty master node saved by itself, and returning log information of the faulty master node to the new master node;
  • the query request information includes information such as identity information or task ID information of the faulty master node, which can identify the log record of the faulty master node; the identity information of the faulty master node may be an IP address, a machine name, an ID, and the like. Everything that identifies the identity of the failed Master node.
  • the external interface is called to upload its own log information to the global information monitoring function entity, and the master node is notified that the task it is responsible for has been processed. After receiving the notification, the master node will The task of the Worker node is marked as completed. After receiving notifications from all the Worker nodes that the processing has been completed, the Master node ends the overall task.
  • the present invention further provides a global information monitoring entity that obtains log information, where the global information monitoring entity includes: a storage module and a query module;
  • a storage module configured to save log information uploaded by the master node and the worker node performing the task in real time after the whole task is started;
  • a query module configured to: when a worker node performing a task fails and after receiving the query request information sent by the new worker node, search for log information of the faulty worker node saved by the storage module according to the query request information, and send the log information to the new
  • the worker node returns the log information of the faulty worker node; and/or, when the master node performing the task fails and receives the query request information sent by the new master node, searches for the faulty master node saved by the storage module according to the query request information.
  • the message information is returned to the new Master node and the log information of the failed Master node is returned.
  • the global information monitoring entity may further include a determining module, configured to: when the worker node uploads the information, determine whether the identity information of the node carried in the information of the worker node and the identity information of the saved worker node are Consistent, when determining consistency, The log information of the worker node is saved. Otherwise, the log information of the worker node is discarded.
  • the storage module is further configured to save identity information of the Worker node.
  • the present invention further provides a system for implementing parallel computing.
  • the system includes: a global information monitoring function entity 41, a first worker node 42, and/or a first master node 43;
  • the global information monitoring function entity 41 is configured to record log information of the worker node and the master node that perform the task after the whole task is started;
  • the first worker node 42 is configured to: when the worker node performing the task fails, obtain the log information of the faulty worker node from the global information monitoring function entity 41, and continue to process the faulty worker node from the breakpoint when the fault occurs according to the log information.
  • the first master node 43 is configured to: when the master node performing the task fails, obtain the log information of the faulty master node from the global information monitoring function entity 41 after the self-starting, and according to the log information The processing of the faulty Master node is continued from the breakpoint at the time of the fault.
  • the first worker node 42 may be a healthy worker node that is performing the task, and may also be a healthy worker node that does not perform the task;
  • the first master node 43 may be a master node that performs the task. It can also be another Master node that does not perform this task.
  • the system may further include a User Program unit, a second Master node, and a second Worker node;
  • a User Program unit configured to start a whole task by calling a client library, select a second master node as a master node for performing a task, and send an input data source to be processed to the second master node;
  • the second master node is configured to receive the input data source to be processed sent by the User Program unit, perform the segmentation process on the input data source, and then select the worker node that performs the task, And assign a task to be executed to each worker node that performs the task;
  • the second worker node is configured to perform the assigned task after receiving the task assigned by the second master node.
  • the second worker node may be a collection of more than one worker node performing the task.
  • the second master node is further configured to send information about performing a task to the first worker node 42 when the second worker node fails;
  • the first worker node is specifically configured to: after receiving the information sent by the second master node, send the query request information to the global information monitoring function entity 41, and receive the log of the second worker node returned by the global information monitoring function entity 41.
  • the global information monitoring function entity 41 is further configured to: after receiving the query request information sent by the first worker node 42, search for the information of the second worker node saved by the first worker node according to the query request information, and send the information to the first worker node 41. Returns the information of the second Worker node.
  • the first master node 42 is specifically configured to: when the second master node fails, send the query request information to the global information monitoring function entity 41, and receive the second master node returned by the global information monitoring function entity 41.
  • Log information is specifically configured to: when the second master node fails, send the query request information to the global information monitoring function entity 41, and receive the second master node returned by the global information monitoring function entity 41.
  • the global information monitoring function entity 41 is further configured to: after receiving the query request information sent by the first master node 43 , search for the information of the second master node saved by the first master node 43 according to the query request information, and send the information to the first master node 43 Returns the information of the second master node.
  • the second worker node is further configured to upload its own log information to the global information monitoring function entity 41 in real time after the overall task is started;
  • the second master node is further configured to upload its own log information to the global information monitoring function entity 41 in real time after the overall task is started;
  • the global information monitoring function entity 41 is further configured to save log information of the second worker node and the second master node.
  • the global information monitoring function entity 41 is further configured to determine the identity information of the node carried in the log information of the second worker node and the saved Worker before saving the log information of the second worker node and the second master node. If the identity information of the node is consistent and the consistency is determined, the log information of the second worker node is saved, and the log information of the second worker node is discarded.

Abstract

A method for realizing parallel computing is disclosed. The method comprises: recording log information of Worker nodes and Master nodes executing tasks after an overall task is initiated; when a fault occurs on a Worker node executing a task, obtaining the recorded log information of the broken-down Worker node and keeping on processing the operation flow of the broken-down Worker node from the breakpoint at which the fault occurs according to the log information by a new Worker node; and/ or, when a fault occurs on a Master node executing a task, after a new Master node is initiated, obtaining the recorded log information of the broken-down Master node and keeping on processing the operation flow of the broken-down Master node from the breakpoint at which the fault occurs according to the log information. Furthermore, a system for realizing parallel computing is disclosed. When a fault occurs on a node, it can keep on executing tasks from the breakpoint at which the fault occurs to use the method and the system.

Description

一种实现并行计算的方法及系统 技术领域  Method and system for realizing parallel computing
本发明涉及云计算领域, 特别是指一种实现并行计算的方法及系统。 背景技术  The present invention relates to the field of cloud computing, and more particularly to a method and system for implementing parallel computing. Background technique
MapReduce由 Google的工程师最先提出, 是一种能够并行处理海量数 据的系统架构, MapReduce 系统的工作原理为: 自动将一个任务分解成多 个子任务, 然后并行执行这些子任务, 当所有子任务执行完毕后, 将处理 结果汇总。  MapReduce was first proposed by Google engineers. It is a system architecture that can process massive amounts of data in parallel. The MapReduce system works by: automatically breaking a task into multiple subtasks and then executing these subtasks in parallel, when all subtasks are executed. After the completion, the processing results will be summarized.
图 1 为现有的 MapReduce 系统的架构示意图, 从图 1 中可以看出, MapReduce将数据处理分为两个阶段: 映射(Map ) 阶段和化简 (Reduce ) 阶段。 MapReduce系统主要包括: 客户端(Client ), 宿主(Master )节点及 工人( Worker )节点; 其中, Client用于提交 MapReduce任务, Master节 点用于自动将 MapReduce任务分解为 Map任务和 Reduce任务, 之后将这 些任务调度到 Worker节点上执行, Worker节点用于收到 Master节点发来 的 Map或 Reduce任务请求后, 执行请求中的任务。 MapReduce系统能自 动实现并行处理、 分布数据、 容错、 及均衡负载等功能。  Figure 1 shows the architecture of the existing MapReduce system. As you can see from Figure 1, MapReduce divides the data processing into two phases: the Map phase and the Reduce phase. The MapReduce system mainly includes: a client (Client), a host (Master) node, and a worker (Worker) node; wherein, the Client is used to submit a MapReduce task, and the Master node is used to automatically decompose the MapReduce task into a Map task and a Reduce task, and then These tasks are scheduled to be executed on the Worker node. After the Worker node receives the Map or Reduce task request from the Master node, it performs the task in the request. The MapReduce system automatically implements parallel processing, distributed data, fault tolerance, and balanced load.
现有的 MapReduce系统中,当某个 Worker节点在执行任务的过程中发 生故障时, Master节点将该故障 Worker节点负责的任务, 重新分配给其它 Worker节点, 其它 Worker节点收到任务后, 将该任务从头开始重新执行一 遍。 当 Master节点在整个任务的执行过程中发生故障时, 则需要将整个任 务从头开始全部重新执行一遍, 如此, 降低数据处理效率, 进而影响用户 体验。 发明内容 In the existing MapReduce system, when a worker node fails during the execution of the task, the master node reassigns the task that the faulty worker node is responsible to to other worker nodes. After the other worker nodes receive the task, the other worker node The task is re-executed from the beginning. When the master node fails during the execution of the entire task, the entire task needs to be re-executed from the beginning, thus reducing the data processing efficiency and affecting the user experience. Summary of the invention
有鉴于此, 本发明的主要目的在于提供一种实现并行计算的方法及系 统, 能在节点发生故障时, 从故障发生时刻的断点处继续执行任务。  In view of the above, it is a primary object of the present invention to provide a method and system for implementing parallel computing that can continue to perform tasks from a breakpoint at the time of failure when a node fails.
为达到上述目的, 本发明的技术方案是这样实现的:  In order to achieve the above object, the technical solution of the present invention is achieved as follows:
本发明提供了一种实现并行计算的方法, 该方法包括:  The present invention provides a method for implementing parallel computing, the method comprising:
整体任务启动后,记录执行任务的 Worker节点和 Master节点的日志信 息;  After the overall task is started, the log information of the Worker node and the Master node performing the task is recorded;
当执行任务的 Worker节点出现故障时, 新的 Worker节点获取记录的 故障 Worker节点的日志信息, 并根据日志信息从故障发生时的断点处继续 处理故障 Worker节点的业务流程; 和 /或, 当执行任务的 Master节点出现 故障时,新的 Master节点启动后,获取记录的故障 Master节点的日志信息, 并根据日志信息从故障发生时的断点处继续处理故障 Master节点的业务流 程。  When the worker node performing the task fails, the new worker node obtains the log information of the recorded faulty worker node, and continues to process the business process of the faulty worker node according to the log information from the breakpoint at the time of the fault; and/or, when When the master node that performs the task fails, the new master node starts to obtain the log information of the faulty master node, and continues to process the service flow of the faulty master node from the breakpoint when the fault occurs.
上述方案中, 所述新的 Worker节点获取故障 Worker节点的日志信息,  In the foregoing solution, the new worker node obtains log information of the faulty worker node,
Figure imgf000004_0001
Figure imgf000004_0001
所述新的 Worker节点收到信息后, 向全局信息监控功能实体发送查询 请求信息;  After receiving the information, the new Worker node sends the query request information to the global information monitoring function entity;
所述全局信息监控功能实体收到查询请求信息后, 根据查询请求信息 查找自身保存的故障 Worker节点的日志信息, 并向所述新的 Worker节点 返回故障 Worker节点的日志信息。  After receiving the query request information, the global information monitoring function entity searches the log information of the faulty worker node saved by itself according to the query request information, and returns the log information of the faulty worker node to the new worker node.
上述方案中, 所述新的 Master节点获取故障 Master节点的日志信息, 为:  In the above solution, the new master node obtains the log information of the faulty master node, which is:
所述新的 Master 节点向所述全局信息监控功能实体发送查询请求信 息; Sending a query request letter to the global information monitoring function entity by the new master node Interest rate
所述全局信息监控功能实体收到查询请求信息后, 根据查询请求信息 查找自身保存的故障 Master节点的日志信息, 并向所述新的 Master节点返 回故障 Master节点的日志信息。  After receiving the query request information, the global information monitoring function entity searches for the log information of the faulty Master node saved according to the query request information, and returns the log information of the faulty Master node to the new Master node.
上述方案中,在记录 Master节点和 Worker节点的日志信息之前, 该方 法进一步包括:  In the foregoing solution, before recording the log information of the Master node and the Worker node, the method further includes:
User Program通过调用客户端程序库启动整体任务后,选择一个节点作 为执行任务的 Master节点, 之后向选择的执行任务的 Master节点发送需要 处理的输入数据源;  After the user program starts the overall task by calling the client library, a node is selected as the master node for executing the task, and then the input data source to be processed is sent to the selected master node of the execution task;
所述执行任务的 Master节点收到需要处理的输入数据源后, 将输入数 据源进行分割处理;  After receiving the input data source to be processed, the master node performing the task performs segmentation processing on the input data source;
所述执行任务的 Master选择执行任务的 Worker节点,并向每个执行任 务的 Worker节点分配需要执行的任务;  The master performing the task selects a worker node that executes the task, and assigns a task to be executed to each worker node that performs the task;
执行任务的 Worker节点读取分割后的数据块, 执行分配的任务。 上述方案中,所述记录执行任务的 Worker节点和 Master节点的日志信 息, 为:  The worker node performing the task reads the divided data block and performs the assigned task. In the above solution, the log information of the Worker node and the Master node that record the task is:
整体任务启动后,执行任务的 Worker节点和 Master节点将自身的日志 信息实时上传给所述全局信息监控功能实体;  After the overall task is started, the worker node and the master node performing the task upload their own log information to the global information monitoring function entity in real time;
全局信息监控功能实体保存执行任务的 Worker节点和 Master节点的日 志信息。  The global information monitoring function entity saves the log information of the Worker node and the Master node that perform the task.
上述方案中, 在全局信息监控功能实体保存执行任务的 Worker节点和 Master节点的日志信息之前, 该方法进一步包括:  In the foregoing solution, before the global information monitoring function entity saves the log information of the worker node and the master node that perform the task, the method further includes:
全局信息监控功能实体收到 Worker 节点上传的日志信息后, 判断 Worker节点的曰志信息中携带的节点的身份标识信息与保存的 Worker节点 的身份标识信息是否一致, 确定一致时, 保存 Worker节点的日志信息, 确 定不一致时, 丢弃 Worker节点的日志信息。 After receiving the log information uploaded by the worker node, the global information monitoring function entity determines whether the identity information of the node carried in the information of the worker node is consistent with the identity information of the saved worker node, and when the consistency is determined, the worker node is saved. Log information, indeed When the inconsistency is determined, the log information of the worker node is discarded.
本发明还提供了一种获取日志信息的方法, 该方法包括:  The present invention also provides a method for obtaining log information, the method comprising:
整体任务启动后,实时保存执行任务的 Master节点和 Worker节点的曰 志信息;  After the overall task is started, the information of the Master node and the Worker node of the execution task is saved in real time;
当执行任务的 Worker节点出现故障, 且在收到新的 Worker节点发送 的查询请求信息后, 根据查询请求信息查找保存的故障 Worker节点的日志 信息, 并向所述新的 Worker节点返回故障 Worker节点的曰志信息; 和 /或, 当执行任务的 Master节点出现故障且在收到新的 Master节点发送的查询请 求信息后, 根据查询请求信息查找保存的故障 Master节点的曰志信息, 并 向所述新的 Master节点返回故障 Master节点的曰志信息。  When the worker node performing the task fails, and after receiving the query request information sent by the new Worker node, the log information of the saved faulty worker node is searched according to the query request information, and the faulty worker node is returned to the new Worker node. And/or, when the master node performing the task fails and receives the query request information sent by the new master node, searches for the saved fault information of the master node according to the query request information, and The new Master node returns the fault information of the faulty Master node.
上述方案中,在实时保存执行任务的 Master节点和 Worker节点的日志 信息之前, 该方法进一步包括:  In the foregoing solution, before saving the log information of the master node and the worker node performing the task in real time, the method further includes:
判断 Worker 节点的日志信息中携带的节点的身份标识信息与保存的 Worker节点的身份标识信息是否一致, 确定一致时,保存 Worker节点的曰 志信息, 确定不一致时, 丢弃 Worker节点的日志信息。  It is determined whether the identity information of the node carried in the log information of the worker node is consistent with the identity information of the saved worker node. When the consistency is determined, the information of the worker node is saved, and when the inconsistency is determined, the log information of the worker node is discarded.
本发明还提供了一种获取日志信息的全局信息监控实体, 该全局信息 监控实体包括: 存储模块及查询模块; 其中,  The present invention also provides a global information monitoring entity that obtains log information, where the global information monitoring entity includes: a storage module and a query module;
存储模块, 设置为整体任务启动后, 实时保存执行任务的 Master节点 和 Worker节点上传的日志信息;  The storage module is configured to save the log information uploaded by the master node and the worker node performing the task in real time after the whole task is started;
查询模块, 设置为当执行任务的 Worker 节点出现故障且在收到新的 Worker节点发送的查询请求信息后, 根据查询请求信息查找存储模块保存 的故障 Worker节点的日志信息,并向所述新的 Worker节点返回故障 Worker 节点的日志信息; 和 /或, 当执行任务的 Master节点出现故障且在收到新的 Master节点发送的查询请求信息后, 根据查询请求信息查找存储模块保存 的故障 Master节点的日志信息, 并向所述新的 Master节点返回故障 Master 节点的日志信息。 The query module is configured to: when the worker node performing the task fails and after receiving the query request information sent by the new worker node, search for the log information of the faulty worker node saved by the storage module according to the query request information, and send the log information to the new The worker node returns the log information of the faulty worker node; and/or, when the master node performing the task fails and receives the query request information sent by the new master node, searches for the faulty master node saved by the storage module according to the query request information. Log information, and return the fault master to the new master node Log information for the node.
上述方案中, 该全局信息监控实体进一步包括: 判断模块, 设置为 In the foregoing solution, the global information monitoring entity further includes: a determining module, configured to
Worker节点上传日志信息时,判断 Worker节点的日志信息中携带的该节点 的身份标识信息与保存的 Worker节点的身份标识信息是否一致, 确定一致 时, 保存该 Worker节点的日志信息, 确定不一致时, 丢弃该 Worker节点 的日志信息。 When the worker node uploads the log information, it determines whether the identity information of the node carried in the log information of the worker node is consistent with the identity information of the saved worker node. When the consistency is determined, the log information of the worker node is saved, and when the inconsistency is determined, Discard the log information of the Worker node.
上述方案中, 所述存储模块, 还设置为保存 Worker节点的身份标识信 息。  In the above solution, the storage module is further configured to save the identity information of the worker node.
本发明还提供了一种实现并行计算的系统, 该系统包括: 全局信息监 控功能实体、 第一 Worker节点、 和 /或第一 Master节点; 其中,  The present invention also provides a system for implementing parallel computing, the system comprising: a global information monitoring function entity, a first worker node, and/or a first master node;
全局信息监控功能实体, 设置为整体任务启动后, 记录执行任务的 Worker节点和 Master节点的日志信息;  The global information monitoring function entity is configured to record the log information of the worker node and the master node performing the task after the overall task is started;
第一 Worker节点, 设置为当执行任务的 Worker节点出现故障时, 从 全局信息监控功能实体获取故障 Worker节点的日志信息, 并根据日志信息 从故障发生时的断点处继续处理故障 Worker节点的业务流程; 和 /或, 第一 Master节点, 设置为当执行任务的 Master节点出现故障时, 在自 身启动后, 从全局信息监控功能实体获取故障 Master节点的日志信息, 并 根据日志信息从故障发生时的断点处继续处理故障 Master 节点的业务流 程。  The first worker node is configured to: when the worker node performing the task fails, obtain the log information of the faulty worker node from the global information monitoring function entity, and continue to process the service of the faulty worker node according to the log information from the breakpoint at the time of the fault occurrence. And the first master node is configured to: when the master node performing the task fails, obtain the log information of the faulty master node from the global information monitoring function entity after the self-starting, and according to the log information, the fault occurs. The breakpoint continues to process the business process of the failed master node.
上述方案中 , 该系统进一步包括: User Program单元、 第二 Master节 点及第二 Worker节点; 其中,  In the above solution, the system further includes: a User Program unit, a second Master node, and a second Worker node; wherein
User Program单元,设置为通过调用客户端程序库启动整体任务后,选 择第二 Master节点作为执行任务的 Master节点,并向第二 Master节点发送 需要处理的输入数据源;  The User Program unit is configured to select the second master node as the master node for executing the task after initiating the overall task by calling the client library, and send the input data source to be processed to the second master node;
第二 Master节点, 设置为收到 User Program单元发送的需要处理的输 入数据源后, 将输入数据源进行分割处理, 之后选择执行任务的 Worker节 点, 并向每个执行任务的 Worker节点分配需要执行的任务; The second master node is set to receive the input that needs to be processed sent by the User Program unit. After entering the data source, the input data source is divided, and then the worker node that performs the task is selected, and the task that needs to be executed is assigned to each worker node that performs the task;
第二 Worker节点,设置为收到第二 Master节点分配的任务后,执行分 配的任务。  The second worker node is configured to perform the assigned task after receiving the task assigned by the second master node.
上述方案中, 所述第二 Master节点,还设置为当第二 Worker节点出现 故障时, 向第一 Worker节点发送执行任务的信息;  In the foregoing solution, the second master node is further configured to: when the second worker node fails, send information about performing the task to the first worker node;
所述第一 Worker节点, 设置为: 收到第二 Master节点发送的信息后, 向全局信息监控功能实体发送查询请求信息, 并接收全局信息监控功能实 体返回的第二 Worker节点的日志信息;  The first worker node is configured to: after receiving the information sent by the second master node, send the query request information to the global information monitoring function entity, and receive the log information of the second worker node returned by the global information monitoring function entity;
所述全局信息监控功能实体, 还设置为收到第一 Worker节点发送的查 询请求信息后, 根据查询请求信息查找自身保存的第二 Worker节点的曰志 信息, 并向第一 Worker节点返回第二 Worker节点的日志信息。  The global information monitoring function entity is further configured to: after receiving the query request information sent by the first worker node, search for the information of the second worker node saved by the first worker node according to the query request information, and return the second work to the first worker node. Log information of the Worker node.
上述方案中, 所述第一 Master节点, 设置为: 当第二 Master节点出现 故障时, 向全局信息监控功能实体发送查询请求信息, 并接收全局信息监 控功能实体返回的第二 Master节点的日志信息;  In the above solution, the first master node is configured to: when the second master node fails, send query request information to the global information monitoring function entity, and receive log information of the second master node returned by the global information monitoring function entity. ;
所述全局信息监控功能实体, 还设置为收到第一 Master节点发送的查 询请求信息后, 根据查询请求信息查找自身保存的第二 Master节点的曰志 信息, 并向第一 Master节点返回第二 Master节点的日志信息。  The global information monitoring function entity is further configured to: after receiving the query request information sent by the first master node, search for the information of the second master node saved by the first master node according to the query request information, and return the second information to the first master node. Log information of the master node.
上述方案中, 所述第二 Worker节点, 还设置为在整体任务启动后, 将 自身的日志信息实时上传给全局信息监控功能实体;  In the above solution, the second worker node is further configured to upload its own log information to the global information monitoring function entity in real time after the whole task is started;
所述第二 Master节点, 还设置为在整体任务启动后, 将自身的日志信 息实时上传给全局信息监控功能实体;  The second master node is further configured to upload its own log information to the global information monitoring function entity in real time after the overall task is started;
全局信息监控功能实体, 还设置为保存第二 Worker节点和第二 Master 节点的日志信息。  The global information monitoring function entity is further configured to save log information of the second worker node and the second master node.
上述方案中,所述全局信息监控功能实体,还设置为在保存第二 Worker 节点和第二 Master节点的日志信息之前,判断第二 Worker节点的日志信息 中携带的节点的身份标识信息与保存的 Worker节点的身份标识信息是否一 致, 确定一致时, 保存第二 Worker节点的日志信息, 确定不一致时, 丢弃 第二 Worker节点的日志信息。 In the above solution, the global information monitoring function entity is further set to save the second Worker. Before the log information of the node and the second master node is determined, it is determined whether the identity information of the node carried in the log information of the second worker node is consistent with the identity information of the saved worker node, and when the consistency is determined, the log of the second worker node is saved. When the information is determined to be inconsistent, the log information of the second worker node is discarded.
本发明提供的实现并行计算的方法及系统, 新的 Worker节点获取记录 的故障 Worker节点的日志信息, 并根据日志信息从故障发生时的断点处继 续处理故障 Worker节点的业务流程; 和 /或, 新的 Master获取记录的故障 Master节点的日志信息, 并根据日志信息从故障发生时的断点处继续处理 故障 Master节点的业务流程, 如此, 能在节点发生故障时, 从故障发生时 刻的断点处继续执行任务, 进而提高数据的处理效率, 节省系统资源, 提 升用户体验。 附图说明  The method and system for implementing parallel computing provided by the present invention, the new worker node obtains the log information of the recorded faulty worker node, and continues to process the business process of the faulty worker node from the breakpoint at the time of the fault according to the log information; and/or The new master obtains the log information of the faulty master node, and continues to process the service flow of the faulty master node from the breakpoint at the time of the fault according to the log information, so that when the node fails, the fault occurs at the moment of the fault. Continue to perform tasks at the point, thereby improving data processing efficiency, saving system resources, and improving user experience. DRAWINGS
图 1为现有的 MapReduce系统的架构示意图;  FIG. 1 is a schematic structural diagram of an existing MapReduce system;
图 2为本发明实施例实现并行计算的方法流程示意图;  2 is a schematic flowchart of a method for implementing parallel computing according to an embodiment of the present invention;
图 3为本发明实施例记录 Master节点和 Worker节点的日志信息之前的 方法流程示意图;  3 is a schematic flowchart of a method before recording log information of a Master node and a Worker node according to an embodiment of the present invention;
图 4为本发明实施例实现并行计算的系统结构示意图。 具体实施方式  FIG. 4 is a schematic structural diagram of a system for implementing parallel computing according to an embodiment of the present invention. detailed description
下面结合附图及具体实施例对本发明再作进一步详细的说明。  The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
本发明实现并行计算的方法, 如图 2所示, 包括以下步骤:  The method for implementing parallel computing according to the present invention, as shown in FIG. 2, includes the following steps:
步骤 201: 整体任务启动后, 记录执行任务的 Worker节点和 Master节 点的日志信息;  Step 201: After the whole task is started, record the log information of the worker node and the master node that execute the task;
这里,在记录 Master节点和 Worker节点的日志信息之前,如图 3所示, 该方法还可以进一步包括以下步骤: 步驟 301 : 用户程序 ( User Program )通过调用客户端程序库启动整体 任务后, 选择一个节点作为执行任务的 Master节点, 之后向选择的执行任 务的 Master节点发送需要处理的输入数据源; Here, before recording the log information of the Master node and the Worker node, as shown in FIG. 3, the method may further include the following steps: Step 301: After the user program starts the overall task by calling the client program library, selecting a node as the master node for executing the task, and then sending the input data source to be processed to the selected master node performing the task;
步骤 302: 执行任务的 Master节点收到需要处理的输入数据源后, 将 输入数据源进行分割处理, 之后执行步骤 303;  Step 302: After the master node performing the task receives the input data source to be processed, the input data source is segmented, and then step 303 is performed;
这里, Master节点可以调用 User Program中的分割函数, 将输入数据 源进行分割处理; User Program可以事先将调用程序参数告诉 Master节点, 或者, 可以事先将调用函数通过消息的方式发送给 Master节点。  Here, the Master node can call the split function in the User Program to divide the input data source; the User Program can inform the Master node of the calling program parameters in advance, or can send the calling function to the Master node in advance by means of a message.
步骤 303: 执行任务的节点选择执行任务的 Worker节点, 并向每个执 行任务的 Worker节点分配需要执行的任务;  Step 303: The node performing the task selects the Worker node that executes the task, and assigns a task to be executed to each Worker node that executes the task;
步骤 304: 执行任务的 Worker节点读取分割后的数据块, 执行分配的 任务;  Step 304: The worker node performing the task reads the divided data block and performs the assigned task.
其中, 步骤 30卜 304与现有的处理过程完全相同, 这里不再赘述; 所述日志信息包括: 节点运行的状态信息及业务处理流程的状态和关 键数据; 其中, 所述节点运行的状态信息, 可以是: 网络状况、 CPU、 内 存、 磁盘空间、 Map任务或 Reduce任务的执行状态等; 所述业务处理流程 的状态和关键数据与处理的具体的业务流程相关, 举个例子来说, 对于一 个使用 MapReduce并行向 10万个手机用户发送天气预报的短信息的业务流 程, 则所述业务处理流程的状态和关键数据包含手机用户的电话号码信息; 在实际应用时, 可以在 MapReduce系统中增设一个全局信息监控功能 实体,由全局信息监控功能实体记录 Master节点和 Worker节点的日志信息, 并且预先在 MapReduce系统中的所有节点上配置全局信息监控功能实体的 身份标识信息, 所述全局信息监控功能实体的身份标识信息可以是网络之 间互联的协议 ( IP, Internet Protocol )地址、 身份标识号码 ( ID, Identity ) 等一切能表明全局信息监控功能实体身份的信息; MapReduce 系统中的所 有节点可以根据所述全局信息监控功能实体的身份标识信息, 上传自身的 日志信息到全局信息监控功能实体;整体任务启动后, Master节点和 Worker 节点将自身的日志信息实时上传给全局信息监控功能实体; The step 30 is the same as the existing process, and is not described here. The log information includes: status information of the node operation and status and key data of the service process flow; wherein, the status information of the node operation , which may be: network status, CPU, memory, disk space, execution status of a Map task or a Reduce task, etc.; the status and key data of the business process flow are related to the specific business process being processed, for example, A business process of using MapReduce to send short messages of weather forecast to 100,000 mobile phone users in parallel, the state and key data of the business process flow include phone number information of the mobile phone user; in actual application, it may be added in the MapReduce system. A global information monitoring function entity records the log information of the master node and the worker node by the global information monitoring function entity, and configures the identity information of the global information monitoring function entity on all nodes in the MapReduce system in advance, and the global information monitoring function Entity identity Information may be interconnection between network protocols (IP, Internet Protocol) address, all the identification number (ID, Identity) can show information such as the identity of the entity of the global information monitoring function; MapReduce system in the A node may upload its own log information to the global information monitoring function entity according to the identity information of the global information monitoring function entity; after the overall task is started, the master node and the worker node upload their own log information to the global information monitoring function in real time. entity;
为了保证整个日志记录过程可靠, 当整体任务启动后, Master节点将 整体任务分配给哪些 Worker节点执行, 并将这些 Worker节点的身份标识 信息发送给全局信息监控功能实体, 全局信息监控功能实体接收并保存 Worker节点的身份标识信息, 如果有 Worker节点上传日志信息时, 全局信 息监控功能实体根据保存的 Worker 节点的身份标识信息判断是否保存该 Worker节点的日志信息, 具体地, 当 Worker节点的日志信息中携带的该节 点的身份标识信息与保存的 Worker节点的身份标识信息一致时, 则保存该 Worker节点的日志信息,否则,丢弃该 Worker节点的日志信息;所述 Worker 节点的身份标识信息是指能标识 Worker节点身份的信息, 比如: IP地址、 机器名称、 或 ID等;  In order to ensure that the entire log recording process is reliable, when the overall task is started, the master node assigns the overall task to which worker nodes to execute, and sends the identity information of the worker nodes to the global information monitoring function entity, and the global information monitoring function entity receives and The identifier information of the worker node is saved. If the worker node uploads the log information, the global information monitoring function entity determines whether to save the log information of the worker node according to the saved identity information of the worker node, specifically, the log information of the worker node. If the identity information of the node is the same as the identity information of the saved worker node, the log information of the worker node is saved, otherwise, the log information of the worker node is discarded; the identity information of the worker node refers to Information that identifies the identity of the Worker node, such as: IP address, machine name, or ID;
所述全局信息监控功能实体的具体形式可以是日志数据库, 还可以是 由一个或多个节点组成的集合体;  The specific form of the global information monitoring function entity may be a log database, or may be an aggregate composed of one or more nodes;
所述 Worker节点是指执行该任务的所有 Worker节点的集合。  The Worker node refers to a collection of all Worker nodes that perform the task.
步骤 202: 当执行任务的 Worker节点出现故障时, 新的 Worker节点获 取记录的故障 Worker节点的日志信息, 并根据日志信息从故障发生时的断 点处继续处理故障 Worker节点的业务流程; 和 /或, 当执行任务的 Master 节点出现故障时 , 新的 Master节点启动后, 获取记录的故障 Master节点的 曰志信息, 并根据曰志信息从故障发生时的断点处继续处理故障 Master节 点的业务流程;  Step 202: When the worker node performing the task fails, the new worker node obtains the log information of the faulty worker node, and continues to process the business process of the faulty worker node according to the log information from the breakpoint when the fault occurs; and / Or, when the master node performing the task fails, the new master node starts, obtains the recorded fault information of the master node, and continues to process the faulty master node service from the breakpoint when the fault occurs according to the information. Process;
这里, Master节点通过自身与 Worker节点之间的心跳检测, 可以知道 执行任务的 Worker节点出现故障; 在执行任务的 Worker节点出现故障后, Master 节点可以依据 MapReduce 系统中其它节点的负载情况, 即: 现有 MapReduce系统中的自动负载均衡的处理, 选出一个节点作为新的 Worker 节点; 所述新的 Worker节点可以是正在执行该任务的健康的 Worker节点, 还可以是没有执行该任务的健康的 Worker节点; Here, the master node can know that the worker node performing the task is faulty through the heartbeat detection between itself and the worker node; after the worker node performing the task fails, the master node can be based on the load of other nodes in the MapReduce system, namely: existing Automatic load balancing processing in the MapReduce system, selecting a node as a new Worker node; the new Worker node may be a healthy Worker node that is performing the task, or may be a healthy Worker node that does not perform the task. ;
在任务启动后, MapReduce系统的 User Program会启动一个定时器, 在定时器超时后, 还没有收到 Master节点返回的任务执行结果, 就认为该 Master节点出现故障,需要选择一个新的节点作为 Master节点,在选择时, 可以依据 MapReduce系统中其它节点的负载情况, 即: 现有 MapReduce系 统中的自动负载均衡的处理, 选出一个节点作为新的 Master节点; 所述新 的 Master节点可以是执行该任务的 Master节点, 还可以是没有执行该任务 的其它 Master节点;  After the task is started, the User Program of the MapReduce system starts a timer. After the timer expires, the task execution result returned by the master node has not been received. The master node is considered to be faulty. You need to select a new node as the master. The node, when selected, can be based on the load of other nodes in the MapReduce system, that is, the automatic load balancing processing in the existing MapReduce system, and select a node as the new master node; the new master node can be executed. The master node of the task may also be another master node that does not perform the task;
所述新的 Worker节点获取故障 Worker节点的日志信息, 具体为: Master节点向所述新的 Worker节点发送执行任务的信息;  The new worker node obtains the log information of the faulty worker node, where specifically: the master node sends information about performing the task to the new worker node;
所述新的 Worker节点收到信息后, 向全局信息监控功能实体发送查询 请求信息;  After receiving the information, the new Worker node sends the query request information to the global information monitoring function entity;
全局信息监控功能实体收到查询请求信息后, 根据查询请求信息查找 自身保存的故障 Worker节点的曰志信息, 并向所述新的 Worker节点返回 故障 Worker节点的日志信息;  After receiving the query request information, the global information monitoring function entity searches for the fault information of the faulty worker node saved by itself according to the query request information, and returns the log information of the faulty worker node to the new worker node;
其中, 所述执行任务的信息包含任务数据源、 任务 ID、 故障 Worker 节点的身份标识信息等;  The information about the execution task includes a task data source, a task ID, and identity information of the faulty worker node;
所述查询请求信息包含任务 ID、 故障 Worker的节点身份标识信息等, 所述故障 Worker的节点身份标识信息可以是 IP地址、 机器名称、 ID等一 切能标识故障 Worker节点身份的信息;  The query request information includes a task ID, a node identifier information of the fault worker, and the like, and the node identifier information of the fault worker may be information such as an IP address, a machine name, an ID, and the like, which can identify the identity of the faulty worker node;
所述新的 Master节点获取故障 Master节点的日志信息, 具体为: 所述新的 Master节点向全局信息监控功能实体发送查询请求信息; 全局信息监控功能实体收到查询请求信息后, 根据查询请求信息查找 自身保存的故障 Master节点的日志信息, 并向所述新的 Master节点返回故 障 Master节点的日志信息; The new master node obtains the log information of the faulty master node, specifically: the new master node sends the query request information to the global information monitoring function entity; after receiving the query request information, the global information monitoring function entity according to the query request information Find Log information of the faulty master node saved by itself, and returning log information of the faulty master node to the new master node;
其中, 所述查询请求信息包含故障 Master节点的身份标识信息或任务 ID信息等能识别出故障 Master节点日志记录的信息;所述故障 Master节点 的身份标识信息可以是 IP地址、 机器名称、 ID等一切能标识故障 Master 节点身份的信息。  The query request information includes information such as identity information or task ID information of the faulty master node, which can identify the log record of the faulty master node; the identity information of the faulty master node may be an IP address, a machine name, an ID, and the like. Everything that identifies the identity of the failed Master node.
每个 Worker节点的任务执行完毕时, 会调用外部接口将自身的日志信 息上传到全局信息监控功能实体, 同时通知 Master节点, 自身负责的任务 已经处理完毕; Master节点收到通知后, 将自身的 Worker节点的任务标记 成已完成。 当收到所有 Worker节点发送的已经处理完成的通知后, Master 节点结束整体任务。  When the task of each worker node is completed, the external interface is called to upload its own log information to the global information monitoring function entity, and the master node is notified that the task it is responsible for has been processed. After receiving the notification, the master node will The task of the Worker node is marked as completed. After receiving notifications from all the Worker nodes that the processing has been completed, the Master node ends the overall task.
为实现上述方法, 本发明还提供了一种获取日志信息的全局信息监控 实体, 该全局信息监控实体包括: 存储模块及查询模块; 其中,  In order to implement the foregoing method, the present invention further provides a global information monitoring entity that obtains log information, where the global information monitoring entity includes: a storage module and a query module;
存储模块, 用于整体任务启动后, 实时保存执行任务的 Master节点和 Worker节点上传的日志信息;  a storage module, configured to save log information uploaded by the master node and the worker node performing the task in real time after the whole task is started;
查询模块, 用于当执行任务的 Worker 节点出现故障且在收到新的 Worker节点发送的查询请求信息后, 根据查询请求信息查找存储模块保存 的故障 Worker节点的日志信息,并向所述新的 Worker节点返回故障 Worker 节点的日志信息; 和 /或, 当执行任务的 Master节点出现故障且在收到新的 Master节点发送的查询请求信息后, 根据查询倩求信息查找存储模块保存 的故障 Master节点的曰志信息, 并向所述新的 Master节点返回故障 Master 节点的日志信息。  a query module, configured to: when a worker node performing a task fails and after receiving the query request information sent by the new worker node, search for log information of the faulty worker node saved by the storage module according to the query request information, and send the log information to the new The worker node returns the log information of the faulty worker node; and/or, when the master node performing the task fails and receives the query request information sent by the new master node, searches for the faulty master node saved by the storage module according to the query request information. The message information is returned to the new Master node and the log information of the failed Master node is returned.
其中, 该全局信息监控实体还可以进一步包括判断模块, 用于 Worker 节点上传曰志信息时, 判断 Worker节点的曰志信息中携带的该节点的身份 标识信息与保存的 Worker节点的身份标识信息是否一致, 确定一致时, 保 存该 Worker节点的日志信息, 否则, 丢弃该 Worker节点的日志信息。 所述存储模块, 还用于保存 Worker节点的身份标识信息。 The global information monitoring entity may further include a determining module, configured to: when the worker node uploads the information, determine whether the identity information of the node carried in the information of the worker node and the identity information of the saved worker node are Consistent, when determining consistency, The log information of the worker node is saved. Otherwise, the log information of the worker node is discarded. The storage module is further configured to save identity information of the Worker node.
同时, 本发明又提供了一种实现并行计算的系统, 如图 4所示, 该系 统包括:全局信息监控功能实体 41、第一 Worker节点 42、和 /或第一 Master 节点 43; 其中,  In addition, the present invention further provides a system for implementing parallel computing. As shown in FIG. 4, the system includes: a global information monitoring function entity 41, a first worker node 42, and/or a first master node 43;
全局信息监控功能实体 41, 用于整体任务启动后, 记录执行任务的 Worker节点和 Master节点的日志信息;  The global information monitoring function entity 41 is configured to record log information of the worker node and the master node that perform the task after the whole task is started;
第一 Worker节点 42 , 用于当执行任务的 Worker节点出现故障时 , 从 全局信息监控功能实体 41获取故障 Worker节点的日志信息, 并根据日志 信息从故障发生时的断点处继续处理故障 Worker节点的业务流程; 和 /或, 第一 Master节点 43 ,用于当执行任务的 Master节点出现故障时,在自 身启动后, 从全局信息监控功能实体 41获取故障 Master节点的日志信息, 并根据日志信息从故障发生时的断点处继续处理故障 Master节点的业务流 程。  The first worker node 42 is configured to: when the worker node performing the task fails, obtain the log information of the faulty worker node from the global information monitoring function entity 41, and continue to process the faulty worker node from the breakpoint when the fault occurs according to the log information. And the first master node 43 is configured to: when the master node performing the task fails, obtain the log information of the faulty master node from the global information monitoring function entity 41 after the self-starting, and according to the log information The processing of the faulty Master node is continued from the breakpoint at the time of the fault.
这里, 需要说明的是: 第一 Worker节点 42可以是正在执行该任务的 健康的 Worker节点, 还可以是没有执行该任务的健康的 Worker节点; 第 一 Master节点 43可以是执行该任务的 Master节点, 还可以是没有执行该 任务的其它 Master节点。  Here, it should be noted that: the first worker node 42 may be a healthy worker node that is performing the task, and may also be a healthy worker node that does not perform the task; the first master node 43 may be a master node that performs the task. It can also be another Master node that does not perform this task.
其中, 该系统还可以进一步包括 User Program单元、 第二 Master节点 及第二 Worker节点; 其中,  The system may further include a User Program unit, a second Master node, and a second Worker node;
User Program单元,用于通过调用客户端程序库启动整体任务后,选择 第二 Master节点作为执行任务的 Master节点,并向第二 Master节点发送需 要处理的输入数据源;  a User Program unit, configured to start a whole task by calling a client library, select a second master node as a master node for performing a task, and send an input data source to be processed to the second master node;
第二 Master节点, 用于收到 User Program单元发送的需要处理的输入 数据源后,将输入数据源进行分割处理,之后选择执行任务的 Worker节点, 并向每个执行任务的 Worker节点分配需要执行的任务; The second master node is configured to receive the input data source to be processed sent by the User Program unit, perform the segmentation process on the input data source, and then select the worker node that performs the task, And assign a task to be executed to each worker node that performs the task;
第二 Worker节点, 用于收到笫二 Master节点分配的任务后,执行分配 的任务。  The second worker node is configured to perform the assigned task after receiving the task assigned by the second master node.
这里, 需要说明的是: 第二 Worker节点可以是一个以上执行任务的 Worker节点的集合。  Here, it should be noted that: the second worker node may be a collection of more than one worker node performing the task.
其中, 所述第二 Master节点, 还用于当第二 Worker节点出现故障时, 向第一 Worker节点 42发送执行任务的信息;  The second master node is further configured to send information about performing a task to the first worker node 42 when the second worker node fails;
所述第一 Worker节点,具体用于:收到第二 Master节点发送的信息后, 向全局信息监控功能实体 41发送查询请求信息, 并接收全局信息监控功能 实体 41返回的第二 Worker节点的日志信息;  The first worker node is specifically configured to: after receiving the information sent by the second master node, send the query request information to the global information monitoring function entity 41, and receive the log of the second worker node returned by the global information monitoring function entity 41. Information
所述全局信息监控功能实体 41 , 还用于收到第一 Worker节点 42发送 的查询请求信息后, 根据查询请求信息查找自身保存的第二 Worker节点的 曰志信息, 并向第一 Worker节点 41返回第二 Worker节点的曰志信息。  The global information monitoring function entity 41 is further configured to: after receiving the query request information sent by the first worker node 42, search for the information of the second worker node saved by the first worker node according to the query request information, and send the information to the first worker node 41. Returns the information of the second Worker node.
其中, 所述第一 Master节点 42, 具体用于: 当第二 Master节点出现故 障时, 向全局信息监控功能实体 41发送查询请求信息, 并接收全局信息监 控功能实体 41返回的第二 Master节点的日志信息;  The first master node 42 is specifically configured to: when the second master node fails, send the query request information to the global information monitoring function entity 41, and receive the second master node returned by the global information monitoring function entity 41. Log information
所述全局信息监控功能实体 41, 还用于收到第一 Master节点 43发送 的查询请求信息后, 根据查询请求信息查找自身保存的第二 Master节点的 曰志信息, 并向第一 Master节点 43返回第二 Master节点的曰志信息。  The global information monitoring function entity 41 is further configured to: after receiving the query request information sent by the first master node 43 , search for the information of the second master node saved by the first master node 43 according to the query request information, and send the information to the first master node 43 Returns the information of the second master node.
所述第二 Worker节点, 还用于在整体任务启动后, 将自身的日志信息 实时上传给全局信息监控功能实体 41;  The second worker node is further configured to upload its own log information to the global information monitoring function entity 41 in real time after the overall task is started;
所述第二 Master节点, 还用于在整体任务启动后, 将自身的日志信息 实时上传给全局信息监控功能实体 41;  The second master node is further configured to upload its own log information to the global information monitoring function entity 41 in real time after the overall task is started;
全局信息监控功能实体 41,还用于保存第二 Worker节点和第二 Master 节点的日志信息。 其中, 所述全局信息监控功能实体 41 ,还用于在保存第二 Worker节点 和第二 Master节点的日志信息之前,判断第二 Worker节点的日志信息中携 带的节点的身份标识信息与保存的 Worker节点的身份标识信息是否一致, 确定一致, 则保存第二 Worker节点的日志信息, 确定不一致, 则丢弃第二 Worker节点的日志信息。 The global information monitoring function entity 41 is further configured to save log information of the second worker node and the second master node. The global information monitoring function entity 41 is further configured to determine the identity information of the node carried in the log information of the second worker node and the saved Worker before saving the log information of the second worker node and the second master node. If the identity information of the node is consistent and the consistency is determined, the log information of the second worker node is saved, and the log information of the second worker node is discarded.
以上所述, 仅为本发明的较佳实施例而已, 并非用于限定本发明的保 护范围, 凡在本发明的精神和原则之内所作的任何修改、 等同替换和改进 等, 均应包含在本发明的保护范围之内。  The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included. Within the scope of protection of the present invention.

Claims

权利要求书 Claim
1、 一种实现并行计算的方法, 其中, 该方法包括:  A method for implementing parallel computing, wherein the method comprises:
整体任务启动后,记录执行任务的工人( Worker )节点和宿主( Master ) 节点的日志信息;  After the overall task is started, record the log information of the worker node and the master node that perform the task;
当执行任务的 Worker节点出现故障时, 新的 Worker节点获取记录的 故障 Worker节点的日志信息, 并根据日志信息从故障发生时的断点处继续 处理故障 Worker节点的业务流程; 和 /或, 当执行任务的 Master节点出现 故障时,新的 Master节点启动后,获取记录的故障 Master节点的日志信息, 并根据日志信息从故障发生时的断点处继续处理故障 Master节点的业务流 程。  When the worker node performing the task fails, the new worker node obtains the log information of the recorded faulty worker node, and continues to process the business process of the faulty worker node according to the log information from the breakpoint at the time of the fault; and/or, when When the master node that performs the task fails, the new master node starts to obtain the log information of the faulty master node, and continues to process the service flow of the faulty master node from the breakpoint when the fault occurs.
2、 根据权利要求 1所述的方法, 其中, 所述新的 Worker节点获取故 障 Worker节点的曰志信息, 为:  2. The method according to claim 1, wherein the new Worker node obtains the information of the faulty worker node, which is:
所述执行任务的 Master节点向所述新的 Worker节点发送执行任务的信 所述新的 Worker节点收到信息后, 向全局信息监控功能实体发送查询 请求信息;  The master node performing the task sends a message to perform execution of the task to the new worker node, and after receiving the information, the new worker node sends the query request information to the global information monitoring function entity;
所述全局信息监控功能实体收到查询请求信息后, 根据查询请求信息 查找自身保存的故障 Worker节点的日志信息, 并向所述新的 Worker节点 返回故障 Worker节点的日志信息。  After receiving the query request information, the global information monitoring function entity searches the log information of the faulty worker node saved by itself according to the query request information, and returns the log information of the faulty worker node to the new worker node.
3、根据权利要求 1所述的方法, 其中, 所述新的 Master节点获取故障 Master节点的日志信息, 为:  The method according to claim 1, wherein the new master node obtains log information of the faulty master node, which is:
所述新的 Master 节点向所述全局信息监控功能实体发送查询请求信 息;  The new master node sends query request information to the global information monitoring function entity;
所述全局信息监控功能实体收到查询请求信息后, 根据查询请求信息 查找自身保存的故障 Master节点的曰志信息, 并向所述新的 Master节点返 回故障 Master节点的日志信息。 After receiving the query request information, the global information monitoring function entity searches for the information of the faulty Master node saved by itself according to the query request information, and returns to the new Master node. Returns the log information of the faulty master node.
4、根据权利要求 1、2或 3所述的方法,其中,在记录执行任务的 Worker 节点和 Master节点的日志信息之前, 该方法进一步包括:  The method according to claim 1, 2 or 3, wherein before the log information of the worker node and the master node performing the task is recorded, the method further comprises:
User Program通过调用客户端程序库启动整体任务后,选择一个节点作 为执行任务的 Master节点, 之后向选择的执行任务的 Master节点发送需要 处理的输入数据源;  After the user program starts the overall task by calling the client library, a node is selected as the master node for executing the task, and then the input data source to be processed is sent to the selected master node of the execution task;
所述执行任务的 Master节点收到需要处理的输入数据源后, 将输入数 据源进行分割处理;  After receiving the input data source to be processed, the master node performing the task performs segmentation processing on the input data source;
所述执行任务的 Master选择执行任务的 Worker节点,并向每个执行任 务的 Worker节点分配需要执行的任务;  The master performing the task selects a worker node that executes the task, and assigns a task to be executed to each worker node that performs the task;
执行任务的 Worker节点读取分割后的数据块, 执行分配的任务。  The worker node performing the task reads the divided data block and performs the assigned task.
5、 根据权利要求 4所述的方法, 其中, 所述记录执行任务的 Worker 节点和 Master节点的日志信息, 为:  5. The method according to claim 4, wherein the log information of the worker node and the master node that record the task is:
整体任务启动后,执行任务的 Worker节点和 Master节点将自身的曰志 信息实时上传给所述全局信息监控功能实体;  After the overall task is started, the worker node and the master node performing the task upload their own information to the global information monitoring function entity in real time;
所述全局信息监控功能实体保存执行任务的 Worker节点和 Master节点 的日志信息。  The global information monitoring function entity saves log information of the Worker node and the Master node that perform the task.
6、 根据权利要求 5所述的方法, 其中, 在全局信息监控功能实体保存 执行任务的 Worker节点和 Master节点的日志信息之前,该方法进一步包括: 所述全局信息监控功能实体收到 Worker节点上传的日志信息后, 判断 Worker节点的日志信息中携带的节点的身份标识信息与保存的 Worker节点 的身份标识信息是否一致, 确定一致时, 保存 Worker节点的日志信息, 确 定不一致时, 丢弃 Worker节点的日志信息。  The method according to claim 5, wherein before the global information monitoring function entity saves the log information of the worker node and the master node that perform the task, the method further includes: the global information monitoring function entity receiving the worker node uploading After the log information is obtained, it is determined whether the identity information of the node carried in the log information of the worker node is consistent with the identity information of the saved worker node. When the consistency is determined, the log information of the worker node is saved, and when the inconsistency is determined, the worker node is discarded. Log information.
7、 一种获取日志信息的方法, 其中, 该方法包括:  7. A method for obtaining log information, wherein the method comprises:
整体任务启动后,实时保存执行任务的 Master节点和 Worker节点的日 志信息; After the overall task is started, the date of the Master node and the Worker node that execute the task is saved in real time. Information
当执行任务的 Worker节点出现故障, 且在收到新的 Worker节点发送 的查询请求信息后, 根据查询请求信息查找保存的故障 Worker节点的曰志 信息, 并向所述新的 Worker节点返回故障 Worker节点的曰志信息; 和 /或, 当执行任务的 Master节点出现故障且在收到新的 Master节点发送的查询请 求信息后, 根据查询请求信息查找保存的故障 Master节点的曰志信息, 并 向所述新的 Master节点返回故障 Master节点的曰志信息。  When the worker node performing the task fails, and after receiving the query request information sent by the new worker node, searching for the saved fault information of the faulty worker node according to the query request information, and returning the faulty worker to the new worker node The node information of the node; and/or, when the master node performing the task fails and receives the query request information sent by the new master node, searches for the saved fault information of the master node according to the query request information, and The new Master node returns the fault information of the faulty Master node.
8、 根据权利要求 7所述的方法, 其中, 在实时保存执行任务的 Master 节点和 Worker节点的日志信息之前, 该方法进一步包括:  8. The method according to claim 7, wherein, before the log information of the master node and the worker node performing the task are saved in real time, the method further includes:
判断 Worker 节点的曰志信息中携带的节点的身份标识信息与保存的 Worker节点的身份标识信息是否一致, 确定一致时,保存 Worker节点的日 志信息, 确定不一致时, 丢弃 Worker节点的日志信息。  It is determined whether the identity information of the node carried in the information of the worker node is consistent with the identity information of the saved worker node. When the consistency is determined, the log information of the worker node is saved, and when the inconsistency is determined, the log information of the worker node is discarded.
9、 一种获取日志信息的全局信息监控实体, 其中, 该全局信息监控实 体包括: 存储模块及查询模块; 其中,  A global information monitoring entity that obtains log information, where the global information monitoring entity includes: a storage module and a query module;
存储模块, 设置为整体任务启动后, 实时保存执行任务的 Master节点 和 Worker节点上传的日志信息;  The storage module is configured to save the log information uploaded by the master node and the worker node performing the task in real time after the whole task is started;
查询模块, 设置为当执行任务的 Worker 节点出现故障且在收到新的 Worker节点发送的查询请求信息后, 根据查询请求信息查找存储模块保存 的故障 Worker节点的日志信息,并向所述新的 Worker节点返回故障 Worker 节点的日志信息; 和 /或, 当执行任务的 Master节点出现故障且在收到新的 Master节点发送的查询请求信息后, 根据查询请求信息查找存储模块保存 的故障 Master节点的曰志信息, 并向所述新的 Master节点返回故障 Master 节点的日志信息。  The query module is configured to: when the worker node performing the task fails and after receiving the query request information sent by the new worker node, search for the log information of the faulty worker node saved by the storage module according to the query request information, and send the log information to the new The worker node returns the log information of the faulty worker node; and/or, when the master node performing the task fails and receives the query request information sent by the new master node, searches for the faulty master node saved by the storage module according to the query request information. The information is sent, and the log information of the faulty master node is returned to the new master node.
10、 根据权利要求 9所述的全局信息监控实体, 其中, 该全局信息监 控实体进一步包括: 判断模块, 设置为 Worker节点上传日志信息时, 判断 Worker 节点的日志信息中携带的该节点的身份标识信息与保存的 Worker 节点的身份标识信息是否一致, 确定一致时, 保存该 Worker节点的日志信 息, 确定不一致时, 丟弃该 Worker节点的日志信息。 The global information monitoring entity according to claim 9, wherein the global information monitoring entity further comprises: a determining module, configured to determine, when the worker node uploads log information, If the identity information of the node in the log information of the worker node is the same as the identity information of the saved worker node, the log information of the worker node is saved when the consistency is determined. .
11、 根据权利要求 9或 10所述的全局信息监控实体, 其中, 所述存储 模块, 还设置为保存 Worker节点的身份标识信息。  The global information monitoring entity according to claim 9 or 10, wherein the storage module is further configured to save identity information of the worker node.
12、 一种实现并行计算的系统, 其中, 该系统包括: 全局信息监控功 能实体、 第一 Worker节点、 和 /或第一 Master节点; 其中,  12. A system for implementing parallel computing, wherein the system comprises: a global information monitoring function entity, a first worker node, and/or a first master node; wherein
全局信息监控功能实体, 设置为整体任务启动后, 记录执行任务的 Worker节点和 Master节点的日志信息;  The global information monitoring function entity is configured to record the log information of the worker node and the master node performing the task after the overall task is started;
第一 Worker节点, 设置为当执行任务的 Worker节点出现故障时, 从 全局信息监控功能实体获取故障 Worker节点的日志信息, 并根据日志信息 从故障发生时的断点处继续处理故障 Worker节点的业务流程; 和 /或, 第一 Master节点, 设置为当执行任务的 Master节点出现故障时, 在自 身启动后, 从全局信息监控功能实体获取故障 Master节点的曰志信息, 并 根据日志信息从故障发生时的断点处继续处理故障 Master 节点的业务流 程。  The first worker node is configured to: when the worker node performing the task fails, obtain the log information of the faulty worker node from the global information monitoring function entity, and continue to process the service of the faulty worker node according to the log information from the breakpoint at the time of the fault occurrence. And the first master node is configured to: when the master node performing the task fails, obtain the information of the faulty master node from the global information monitoring function entity after the self-starting, and generate the fault information according to the log information. At the breakpoint of the time, the business process of the failed master node continues to be processed.
13、 根据权利要求 12 所述的系统, 其中, 该系统进一步包括: User Program单元、 第二 Master节点及第二 Worker节点; 其中,  The system of claim 12, wherein the system further comprises: a User Program unit, a second Master node, and a second Worker node;
User Program单元,设置为通过调用客户端程序库启动整体任务后,选 择第二 Master节点作为执行任务的 Master节点,并向第二 Master节点发送 需要处理的输入数据源;  The User Program unit is configured to select the second master node as the master node for executing the task after initiating the overall task by calling the client library, and send the input data source to be processed to the second master node;
第二 Master节点, 设置为收到 User Program单元发送的需要处理的输 入数据源后, 将输入数据源进行分割处理, 之后选择执行任务的 Worker节 点, 并向每个执行任务的 Worker节点分配需要执行的任务;  The second master node is configured to receive the input data source to be processed sent by the User Program unit, perform the segmentation process on the input data source, select the worker node that performs the task, and assign the worker node to each task to be executed. Task
第二 Worker节点,设置为收到第二 Master节点分配的任务后,执行分 配的任务。 The second worker node is set to receive the task assigned by the second master node, and the execution point is Matching tasks.
14、 根据权利要求 13所述的系统, 其中,  14. The system of claim 13 wherein
所述第二 Master节点,还设置为当第二 Worker节点出现故障时, 向第 一 Worker节点发送执行任务的信息;  The second master node is further configured to send information about performing a task to the first worker node when the second worker node fails;
所述第一 Worker节点, 设置为: 收到第二 Master节点发送的信息后, 向全局信息监控功能实体发送查询请求信息, 并接收全局信息监控功能实 体返回的第二 Worker节点的日志信息;  The first worker node is configured to: after receiving the information sent by the second master node, send the query request information to the global information monitoring function entity, and receive the log information of the second worker node returned by the global information monitoring function entity;
所述全局信息监控功能实体, 还设置为收到第一 Worker节点发送的查 询请求信息后 , 根据查询请求信息查找自身保存的第二 Worker节点的曰志 信息, 并向第一 Worker节点返回第二 Worker节点的曰志信息。  The global information monitoring function entity is further configured to: after receiving the query request information sent by the first worker node, search for the information of the second worker node saved by the first worker node according to the query request information, and return the second work to the first worker node. The information of the worker node.
15、 根据权利要求 13所述的系统, 其中,  15. The system of claim 13 wherein
所述第一 Master节点, 设置为: 当第二 Master节点出现故障时, 向全 局信息监控功能实体发送查询请求信息, 并接收全局信息监控功能实体返 回的第二 Master节点的曰志信息;  The first master node is configured to: when the second master node fails, send the query request information to the global information monitoring function entity, and receive the information of the second master node returned by the global information monitoring function entity;
所述全局信息监控功能实体, 还设置为收到第一 Master节点发送的查 询请求信息后, 根据查询请求信息查找自身保存的第二 Master节点的曰志 信息, 并向第一 Master节点返回第二 Master节点的曰志信息。  The global information monitoring function entity is further configured to: after receiving the query request information sent by the first master node, search for the information of the second master node saved by the first master node according to the query request information, and return the second information to the first master node. The information of the Master node.
16、 根据权利要求 13、 14或 15所述的系统, 其中,  16. The system of claim 13, 14 or 15, wherein
所述第二 Worker节点, 还设置为在整体任务启动后, 将自身的日志信 息实时上传给全局信息监控功能实体;  The second worker node is further configured to upload its own log information to the global information monitoring function entity in real time after the overall task is started;
所述第二 Master节点, 还设置为在整体任务启动后, 将自身的日志信 息实时上传给全局信息监控功能实体;  The second master node is further configured to upload its own log information to the global information monitoring function entity in real time after the overall task is started;
全局信息监控功能实体, 还设置为保存第二 Worker节点和第二 Master 节点的日志信息。  The global information monitoring function entity is further configured to save log information of the second worker node and the second master node.
17、 根据权利要求 16所述的系统, 其中, 所述全局信息监控功能实体, 还设置为在保存第二 Worker节点和第二 Master节点的日志信息之前, 判断第二 Worker节点的日志信息中携带的节 点的身份标识信息与保存的 Worker节点的身份标识信息是否一致, 确定一 致时, 保存第二 Worker节点的日志信息, 确定不一致时, 丢弃第二 Worker 节点的日志信息。 17. The system of claim 16 wherein The global information monitoring function entity is further configured to determine the identity information of the node carried in the log information of the second worker node and the identity of the saved worker node before saving the log information of the second worker node and the second master node. If the identification information is consistent, the log information of the second worker node is saved when the consistency is determined. If the inconsistency is determined, the log information of the second worker node is discarded.
PCT/CN2011/072818 2010-08-27 2011-04-14 Method and system for realizing parallel computing WO2012024937A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010269332.1A CN102385536B (en) 2010-08-27 2010-08-27 Method and system for realization of parallel computing
CN201010269332.1 2010-08-27

Publications (1)

Publication Number Publication Date
WO2012024937A1 true WO2012024937A1 (en) 2012-03-01

Family

ID=45722853

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/072818 WO2012024937A1 (en) 2010-08-27 2011-04-14 Method and system for realizing parallel computing

Country Status (2)

Country Link
CN (1) CN102385536B (en)
WO (1) WO2012024937A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136363A (en) * 2013-03-14 2013-06-05 曙光信息产业(北京)有限公司 Inquiry processing method and cluster data base system
CN104461752B (en) * 2014-11-21 2018-09-18 浙江宇视科技有限公司 A kind of multimedia distributed task processing method of two-stage failure tolerant
CN106789141B (en) 2015-11-24 2020-12-11 阿里巴巴集团控股有限公司 Gateway equipment fault processing method and device
CN107644382A (en) * 2016-07-22 2018-01-30 平安科技(深圳)有限公司 Policy information statistical method and device
CN108959063A (en) * 2017-05-25 2018-12-07 北京京东尚科信息技术有限公司 A kind of method and apparatus that program executes
CN108600008B (en) * 2018-04-24 2021-12-17 致云科技有限公司 Server management method, server management device and distributed system
CN110673936B (en) * 2019-09-18 2022-05-17 平安科技(深圳)有限公司 Breakpoint continuous operation method and device for arrangement service, storage medium and electronic equipment
CN113596148A (en) * 2021-07-27 2021-11-02 上海商汤科技开发有限公司 Data transmission method, system, device, computing equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085792A1 (en) * 2004-10-15 2006-04-20 Microsoft Corporation Systems and methods for a disaster recovery system utilizing virtual machines running on at least two host computers in physically different locations
CN101145946A (en) * 2007-09-17 2008-03-19 中兴通讯股份有限公司 A fault tolerance cluster system and method based on message log
CN101770402A (en) * 2008-12-29 2010-07-07 中国移动通信集团公司 Map task scheduling method, equipment and system in MapReduce system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007172334A (en) * 2005-12-22 2007-07-05 Internatl Business Mach Corp <Ibm> Method, system and program for securing redundancy of parallel computing system
US8230070B2 (en) * 2007-11-09 2012-07-24 Manjrasoft Pty. Ltd. System and method for grid and cloud computing
CN101764835B (en) * 2008-12-25 2012-09-05 华为技术有限公司 Task allocation method and device based on MapReduce programming framework

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085792A1 (en) * 2004-10-15 2006-04-20 Microsoft Corporation Systems and methods for a disaster recovery system utilizing virtual machines running on at least two host computers in physically different locations
CN101145946A (en) * 2007-09-17 2008-03-19 中兴通讯股份有限公司 A fault tolerance cluster system and method based on message log
CN101770402A (en) * 2008-12-29 2010-07-07 中国移动通信集团公司 Map task scheduling method, equipment and system in MapReduce system

Also Published As

Publication number Publication date
CN102385536B (en) 2014-06-11
CN102385536A (en) 2012-03-21

Similar Documents

Publication Publication Date Title
WO2012024937A1 (en) Method and system for realizing parallel computing
US10805363B2 (en) Method, device and system for pushing file
TWI728036B (en) Information processing method, device and system
EP3977278A1 (en) Automated cloud-edge streaming workload distribution and bidirectional migration with lossless, once-only processing
TWI740901B (en) Method and device for performing data recovery operation
CN110311831B (en) Container cloud-based system resource monitoring method and related equipment
US10511480B2 (en) Message flow management for virtual networks
WO2017162011A1 (en) Network element performance data processing method and device, and nms
WO2017107900A1 (en) Virtual machine recovery method and virtual machine management device
WO2017133531A1 (en) Asynchronous service processing method, and server
US9152491B2 (en) Job continuation management apparatus, job continuation management method and job continuation management program
WO2017148297A1 (en) Method and device for joining tables
CN107656705B (en) Computer storage medium and data migration method, device and system
CN105302676A (en) Method and apparatus for transmitting host and backup mechanism data of distributed file system
CN106452836B (en) main node setting method and device
WO2020232871A1 (en) Method and device for microservice dependency analysis
CN110971702A (en) Service calling method and device, computer equipment and storage medium
CN107391303B (en) Data processing method, device, system, server and computer storage medium
CN111352716A (en) Task request method, device and system based on big data and storage medium
JP2018072944A (en) Program, system and information processing method
CN114238703A (en) Event flow arrangement method, device and application
CN104407942A (en) Off-site storage based Linux operation system backup recovery method
WO2016095716A1 (en) Fault information processing method and related device
JP6501924B2 (en) Method and server for canceling alert
CN112035062A (en) Migration method of local storage of cloud computing, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11819312

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11819312

Country of ref document: EP

Kind code of ref document: A1