WO2022007552A1 - 处理节点的管理方法、配置方法及相关装置 - Google Patents

处理节点的管理方法、配置方法及相关装置 Download PDF

Info

Publication number
WO2022007552A1
WO2022007552A1 PCT/CN2021/097956 CN2021097956W WO2022007552A1 WO 2022007552 A1 WO2022007552 A1 WO 2022007552A1 CN 2021097956 W CN2021097956 W CN 2021097956W WO 2022007552 A1 WO2022007552 A1 WO 2022007552A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
node
task
data processing
nodes
Prior art date
Application number
PCT/CN2021/097956
Other languages
English (en)
French (fr)
Inventor
贺俊华
刘保原
曾翔
余伯平
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022007552A1 publication Critical patent/WO2022007552A1/zh
Priority to US17/743,837 priority Critical patent/US20220269564A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2041Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Definitions

  • the present application relates to the field of cloud technology and big data, and in particular, to the management technology of processing nodes.
  • Big data refers to a collection of data that cannot be captured, managed and processed by conventional software tools within a certain time frame. , high growth rate and diversified information assets.
  • parallel computing refers to the process of using multiple computing resources to solve computing problems at the same time.
  • the basic idea of parallel computing is to use multiple processors to solve the same problem cooperatively, that is, the problem to be solved is decomposed into several parts, and each part is calculated by an independent processor. That is to say, for the same data processing task, the computer device can allow different processing nodes to process different parts of the data processing task, thereby realizing parallel computing of big data.
  • the embodiments of the present application provide a management method, apparatus, device, and storage medium for processing nodes, which can improve the reliability of data processing and ensure that data processing tasks can be successfully completed.
  • the technical solution is as follows:
  • a method for managing a processing node is provided, the method is executed by a computer device, and the method includes:
  • the abnormal status information of the exception processing node is acquired, and the processing node cluster includes multiple processing nodes, and the multiple processing nodes use to perform the data processing tasks cooperatively;
  • a method for configuring a processing node is provided, the method is executed by a computer device, and the method includes:
  • the processing node cluster includes multiple processing nodes, and the multiple processing nodes are used to perform the data processing task cooperatively; the auxiliary node is used when there are abnormal processing nodes in the processing node cluster , to perform the task in place of the exception handling node.
  • an apparatus for managing a processing node is provided, the apparatus is deployed on computer equipment, and the apparatus includes:
  • the information acquisition module is configured to acquire the abnormal state information of the abnormal processing node in the case of detecting that there is an abnormal processing node in the processing node cluster corresponding to the data processing task, and the processing node cluster includes a plurality of processing nodes, so The plurality of processing nodes are used for cooperatively executing the data processing task;
  • a node enabling module configured to determine to enable an auxiliary node outside the processing node cluster to replace the abnormal processing node if the abnormal state information satisfies a condition
  • a policy adjustment module configured to adjust an execution policy of the data processing task when it is determined to enable the auxiliary node, where the execution policy is used to indicate a processing mode for the data processing task;
  • a task determination module configured to determine, based on the execution strategy, the data processing subtasks corresponding to the auxiliary node and the remaining processing nodes respectively, where the remaining processing nodes are the processing nodes other than the exception processing nodes in the processing node cluster processing node;
  • An instruction sending module configured to send a corresponding task execution instruction to the auxiliary node and the remaining processing node, where the task execution instruction is used to instruct the auxiliary node and the remaining processing node to execute the corresponding data processing subtask.
  • an apparatus for configuring a processing node is provided, the apparatus is deployed on computer equipment, and the apparatus includes:
  • the task acquisition module is used to acquire data processing tasks
  • an information determination module configured to determine task information corresponding to the data processing task, where the task information refers to information related to data processing conditions during the execution of the data processing task;
  • a node configuration module configured to configure a processing node cluster and auxiliary nodes other than the processing node cluster for the data processing task according to the task information
  • the processing node cluster includes multiple processing nodes, and the multiple processing nodes are used to perform the data processing task cooperatively; the auxiliary node is used when there are abnormal processing nodes in the processing node cluster , to perform the task in place of the exception handling node.
  • a computer device includes a processor and a memory, and the memory stores at least one instruction, at least one program, a code set or an instruction set, the at least one The instructions, the at least one piece of program, the code set or the instruction set are loaded and executed by the processor to implement the above-mentioned method for managing a processing node, or to implement the above-mentioned method for configuring a processing node.
  • a computer-readable storage medium where at least one instruction, at least one segment of program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one segment of The program, the code set or the instruction set is loaded and executed by the processor to implement the above-mentioned method for managing the processing node, or to implement the above-mentioned method for configuring the processing node.
  • a computer program product or computer program where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned method for managing the processing node or the above-mentioned method for configuring the processing node.
  • the abnormal state information of the exception processing node is acquired. If it is determined to enable the auxiliary node according to the abnormal state information, the execution strategy of the data processing task is adjusted, and the data processing subtasks corresponding to the auxiliary node and the remaining processing nodes are re-determined, and the remaining processing nodes are the exception processing nodes in the processing node cluster. In this way, the abnormal processing node is replaced by the auxiliary node, so that the auxiliary node and the remaining processing nodes can cooperate to perform data processing tasks, so as to avoid the failure of data processing tasks caused by abnormal processing nodes, improve the reliability of data processing, and ensure data processing tasks. can be successfully completed.
  • the method can prevent low data processing efficiency caused by unreasonable allocation of data processing tasks, and is beneficial to improve the data processing efficiency of each node, thereby ensuring the processing efficiency of the entire data processing task.
  • FIG. 1 is a schematic diagram of a data processing system provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a method for managing a processing node provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a configuration method of a processing node provided by an embodiment of the present application.
  • FIG. 4 is a block diagram of an apparatus for managing a processing node provided by an embodiment of the present application.
  • FIG. 5 is a block diagram of an apparatus for managing a processing node provided by another embodiment of the present application.
  • FIG. 6 is a block diagram of an apparatus for configuring a processing node provided by an embodiment of the present application.
  • FIG. 7 is a block diagram of an apparatus for configuring a processing node provided by another embodiment of the present application.
  • FIG. 8 is a structural block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 shows a schematic diagram of a data processing system provided by an embodiment of the present application.
  • the data processing system may include a policy control node 10 , a management node 20 , a processing node 30 and an auxiliary node 40 .
  • the policy control node 10 is used to determine the number of management nodes 20 , processing nodes 30 and auxiliary nodes 40 . Generally, after acquiring the data processing task, the policy control node 10 analyzes the data to be processed by the data processing task, and determines the task information corresponding to the data processing task.
  • the task information refers to the data processing information during the execution of the data processing task, such as data processing amount, task processing duration, parallel computing acceleration ratio, and the like.
  • the policy control node 10 determines the number of processing nodes 30 according to the data processing amount and the task processing duration. Then, the number of auxiliary nodes 40 is determined according to the number of processing nodes 30 and the parallel calculation acceleration ratio. After that, the number of the management nodes 20 is determined according to the management capability of the management node 20 for the processing nodes 30 and the number of the processing nodes 30 .
  • the policy control node 10 sends data processing information to the management node 10 after determining the number of the above-mentioned management nodes 20, processing nodes 30 and auxiliary nodes 40, where the data processing information includes the above-mentioned data processing tasks, and The number of management nodes 20 , processing nodes 30 and auxiliary nodes 40 corresponding to the data processing task.
  • the management node 20 is used to manage the processing node 30 and the auxiliary node 40 .
  • the management node 20 may send a task execution instruction to the processing node 30 and the auxiliary node 40, where the task execution instruction is used to control the processing node 30 and the auxiliary node 40 to perform corresponding operations.
  • the management node 20 controls the processing node 30 to perform the above data processing tasks. For example, after receiving the above data processing information, the management node 20 divides the data processing task according to the number of processing nodes 30, determines the data processing subtask corresponding to each processing node 30, and then sends a task execution instruction to the processing node 30. , the task execution instruction is used to control the processing node 30 to execute the corresponding data processing subtask.
  • the management node 20 controls the auxiliary node 40 to perform the above data processing tasks instead of the processing node 30 .
  • the management node 20 obtains the abnormal cause of the processing node 30, determines the repair time of the processing node 30 according to the abnormal cause, and determines when the repair time is greater than the threshold.
  • Enable the auxiliary node 40 to replace the processing node 30 in an abnormal state perform data processing tasks in cooperation with the remaining processing nodes, and send a task execution instruction to the auxiliary node 40, the task execution instruction is used to control the auxiliary node 40 to execute the corresponding data processing subtask .
  • the management node 20 may start multiple auxiliary nodes 40 at the same time.
  • the processing nodes 30 are used to perform data processing tasks.
  • multiple processing nodes 30 may cooperatively process the same data processing task, and the multiple processing nodes 30 may form a processing node cluster.
  • the number of processing nodes 30 is determined by the foregoing policy control node 10, which is not limited in this embodiment of the present application.
  • the processing node 30 executes the corresponding data processing subtask according to the task execution instruction, and periodically sends a measurement report to the management node 20, where the measurement report includes the task Processing information and node state information.
  • the task processing information is used to indicate the task processing progress of the processing node 30
  • the node status information is used to indicate the working status of the processing node 30 .
  • the management node 20 determines whether the processing node 30 is in an abnormal state according to the node state information in the measurement report.
  • the auxiliary node 40 is used to perform data processing tasks instead of the processing node 30 when the processing node 30 is in an abnormal state.
  • the auxiliary node 40 may execute the corresponding data processing subtask according to the task execution instruction. If the auxiliary node 40 does not receive the above task execution instruction, it periodically sends a heartbeat detection packet to the management node 20, and the heartbeat detection packet is used to indicate to the management node 20 that the auxiliary node 40 is in a state of assignable tasks; if the auxiliary node 40 receives When the above task execution instruction is received, the above measurement report is periodically sent to the management node, and the measurement report includes task processing information and node state information.
  • the task processing information is used to indicate the task processing progress of the auxiliary node 40
  • the node status information is used to indicate the working state of the auxiliary node 40 .
  • the above data processing task may be a processing task for big data.
  • big data refers to the collection of data that cannot be captured, managed and processed by conventional software tools within a certain time frame. It requires new processing modes to have stronger decision-making, insight and process optimization capabilities. of massive, high-growth and diversified information assets. With the advent of the cloud era, big data is also attracting more and more attention, and big data requires special technologies to efficiently process a large amount of data that tolerates elapsed time. Technologies applicable to big data, including massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.
  • the data processing system shown in FIG. 1 above can constitute a big data processing system in cloud technology.
  • the big data processing system may include multiple servers.
  • the server may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud Cloud servers for basic cloud computing services such as functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.
  • the foregoing policy control node 10, management node 20, processing node 30, and auxiliary node 40 may be deployed on different servers, which are not limited in this embodiment of the present application.
  • FIG. 2 shows a flowchart of a method for managing a processing node provided by an embodiment of the present application.
  • the execution body of each step may be a computer device, for example, the computer device is the management node 20 in the data processing system of FIG. 1 .
  • the method may include the following steps (steps 201 to 205):
  • Step 201 in a case where it is detected that an exception processing node exists in the processing node cluster corresponding to the data processing task, obtain abnormal state information of the exception processing node.
  • Data processing tasks refer to tasks that process a certain part of data, such as data traversal, data storage, data transformation, and data visualization. It can be understood that the data processing task may be a task of processing all types of data, or may be a task of processing big data, which is not limited in this embodiment of the present application.
  • a processing node cluster refers to a node cluster including a plurality of processing nodes.
  • the multiple processing nodes are used to perform the above data processing tasks cooperatively.
  • the management node may divide the data processing task according to the processing node, and determine the data processing subtasks corresponding to different processing nodes.
  • the management node divides data processing tasks according to the number of processing nodes. For example, after acquiring the above data processing task, the management node divides the data processing task equally according to the data processing volume corresponding to the data processing task and the number corresponding to the processing nodes, and determines the data processing subtask corresponding to each processing node. The same amount of data processing ensures that the workload of the processing nodes is balanced.
  • the management node divides the data processing tasks according to the working capabilities of the processing nodes. For example, after acquiring the above data processing task, the management node divides the data processing task according to the data processing volume corresponding to the data processing task and the data processing efficiency of each processing node, and determines the data processing subtask corresponding to each processing node , to avoid overloading the processing nodes.
  • the management node divides the data processing tasks according to the work types of the processing nodes. For example, after obtaining the above data processing task, the management node divides the data processing task according to the data processing type corresponding to the data processing task and the work type of each processing node, and determines the data processing subtask corresponding to each processing node, In order to ensure the efficient processing efficiency of each processing node.
  • the data processing task as a data conversion task as an example, assuming that the data conversion task includes conversion between A data and B data, the data conversion task of A data to B data is divided into the first processing node, and B data is converted to A data.
  • the data transformation task of the data is divided into the second processing node.
  • the management node may generate a task list, and send the task list to the auxiliary node.
  • the task list records the correspondence between the processing nodes and the data processing subtasks.
  • the abnormal processing node refers to a processing node in an abnormal working state, for example, the processing node cannot work normally due to memory corruption of the processing node.
  • the abnormal state information is used to indicate the reason why a processing node is in an abnormal working state.
  • the management node acquires the abnormal state information of the abnormal processing node when detecting that there is an abnormal processing node in the processing node cluster corresponding to the data processing task.
  • the management node may determine whether the processing node is an exception processing node by measuring whether there is an exception in the report.
  • the above measurement report refers to a report used to report the working status and task processing progress of the processing node to the management node. Based on this, the above step 201 includes the following sub-steps:
  • the measurement report includes task processing information and node status information.
  • the task processing information is used to indicate the task processing progress of the processing node
  • the node status information is used to indicate the working status of the processing node.
  • the processing node may periodically send a measurement report to the management node during the process of executing the above-mentioned data processing subtask.
  • the management node receives the measurement report, and detects the working state and task processing progress of the processing node according to the measurement report.
  • the management node determines that the measurement report from a certain processing node, such as a target processing node, is abnormal, it determines that the target processing node is an abnormality processing node.
  • the abnormality in the measurement report includes at least one of the following: the task processing information included in the measurement report is abnormal, the node state information included in the measurement report is abnormal, and the measurement report is not received for a set period of time.
  • the management node after receiving the measurement report of the target processing node, analyzes the task processing information in the measurement report, and determines that the task processing information is abnormal, such as the task processing progress corresponding to the task processing information is lower than the expected target value, so that the target processing node is determined to be an exception processing node.
  • the management node after receiving the measurement report of the target processing node, analyzes the node status information in the measurement report, and determines that the node status information is abnormal, for example, the node status information indicates that the target processing node is abnormal.
  • the node is in a low-rate working state, so that the target processing node is determined to be an exception processing node.
  • the management node does not receive the measurement report of the target processing node for a set period of time, determines that the measurement report of the target processing node is abnormal, and further determines that the target processing node is an abnormality processing node. In a possible implementation manner, if the management node does not receive the measurement report of the target processing node within the first time period, it is determined that the measurement report is abnormal, and the target processing node is the abnormality processing node. In another possible implementation, if the management node does not receive the measurement report of the target processing node within the first time period, it sends a report acquisition request to the target processing node, where the report acquisition request is used to request the target processing node Get a measurement report.
  • the management node determines whether the target processing node is an abnormal processing node according to the measurement report; if the target processing node does not report to the management node within the second duration If the corresponding measurement report is fed back, the management node determines that the target processing node is an exception processing node.
  • the management node obtains the abnormal state information of the abnormal processing node when it is determined that there is an abnormal processing node in the processing node cluster corresponding to the data processing task.
  • the exception processing node when sending the measurement report to the management node, the exception processing node sends corresponding abnormal state information, for example, the abnormal state information exists in the above measurement report.
  • the management node after determining the exception handling node, the management node sends a state information acquisition request to the exception handling node, where the state information acquisition request is used to request to acquire abnormal state information corresponding to the exception handling node.
  • the exception processing node sends the corresponding exception status information to the management node according to the status information acquisition request.
  • Step 202 if the abnormal state information satisfies the condition, it is determined to enable an auxiliary node outside the processing node cluster to replace the abnormal processing node.
  • the condition refers to the judgment condition used to judge whether to enable the secondary node.
  • the management node can analyze the abnormal state information, and if the abnormal state information satisfies the conditions, it is determined to enable an auxiliary node outside the processing node cluster to replace the abnormal state information.
  • the exception handling node after acquiring the abnormal state information of the abnormal processing node, the management node can analyze the abnormal state information, and if the abnormal state information satisfies the conditions, it is determined to enable an auxiliary node outside the processing node cluster to replace the abnormal state information.
  • the management node determines the abnormal cause of the abnormal processing node according to the abnormal state information.
  • the abnormal cause may be included in the abnormal state information, and after acquiring the abnormal state information, the management node directly determines the abnormal cause from the abnormal state information.
  • the management node obtains the abnormal cause, it determines the repair cost of the abnormal processing node according to the abnormal cause.
  • the repair cost refers to the resources consumed by repairing the above abnormal processing node, such as repair time, repair operation difficulty, and repair required. data etc. When the above repair cost is greater than the threshold, it is determined that the cost-effectiveness of modifying the exception handling node is low, and the auxiliary node is enabled to replace the exception handling node.
  • the management node cannot obtain the above abnormal state information. If the exception processing node is a processing node that cannot send information, the management node may determine that the exception processing node cannot be repaired without acquiring the exception state information, and determine to enable an auxiliary node to replace the exception processing node. For example, if the exception handling node cannot send the measurement report, the management node may assume that the exception handling node cannot send the exception state information, and determine to enable the auxiliary node to replace the exception handling node.
  • the management node may perform a repair configuration on the exception processing node, and convert the exception processing node into an auxiliary node.
  • the above step 202 includes the following sub-steps:
  • the repair instruction is used to repair the above exception handling node. After determining to enable the auxiliary node to replace the above exception processing node, the management node repairs the exception processing node while the auxiliary node performs the data processing task, and sends a repair instruction to the exception processing node.
  • the above repair instruction includes a repair operation for repairing the exception processing node.
  • the management node determines a repair operation for the exception processing node based on the abnormal cause, generates a repair instruction according to the repair operation, and sends the repair instruction to the exception processing node.
  • the exception processing node performs self-repair according to the repair operation in the repair instruction.
  • the repairing operation includes repairing data.
  • the exception processing node can acquire repair data according to the repair instruction, and perform a corresponding operation on the repair data according to the operation instruction in the repair operation, so as to perform self-repair.
  • the above repair instruction in order to prevent failure of transmission of the repair data, further includes an acquisition address and/or identification information of the repair data.
  • the exception processing node can obtain the repair data acquisition address and/or identification information according to the repair instruction, and obtain repair data according to the acquisition address and/or identification information, and then, according to the operation in the repair operation Instructs the corresponding operation to be performed on the repair data for self-repair.
  • the repair completion response is used to instruct the above exception processing node to complete self-repair. If the exception processing node successfully completes its own repair according to the above repair instruction, it sends a repair completion response to the management node. Correspondingly, when the management node receives the repair completion response from the exception handling node, it is determined that the exception handling node is restored from the abnormal state to the normal state.
  • the normal state refers to a state in which the exception processing node can work normally.
  • the configuration information is used to configure the exception handling node as a secondary node.
  • the management node sends configuration information to the exception handling node when it is determined that the exception handling node is restored to a normal state.
  • the exception processing node is configured according to the configuration information, and the exception processing node is converted into an auxiliary node.
  • configuration information may include all configuration information for the auxiliary node, and may also include information corresponding to part of the auxiliary node that needs to be reconfigured relative to the exception handling node, which is not limited in this embodiment of the present application.
  • the configuration completion response is used to instruct the above exception handling node to complete the configuration. If the exception processing node successfully completes the configuration according to the above configuration information, it may send a configuration completion response to the management node. Correspondingly, when the management node receives a configuration completion response from the exception handling node, it is determined that the exception handling node is converted into an auxiliary node.
  • the exception processing node After the exception processing node is converted into an auxiliary node, it may periodically send a heartbeat detection packet to the management node, where the heartbeat detection packet is used to indicate to the management node that the exception processing node is in a state of assignable tasks.
  • the exception processing node is repaired and configured, the exception processing node is converted into an auxiliary node, and the number of auxiliary nodes is supplemented after the auxiliary node is enabled to ensure that the processing node system has enough assistance node.
  • Step 203 adjusting the execution strategy of the data processing task when it is determined that the auxiliary node is enabled.
  • the execution policy is used to indicate the processing mode for the data processing task, such as the division mode for the data processing task, the execution mode for the data processing task, and the like.
  • the management node may determine data processing subtasks corresponding to the auxiliary node or the remaining processing nodes respectively according to the execution policy, and the remaining processing nodes are processing nodes other than the exception processing nodes in the processing node cluster.
  • the management node adjusts the execution strategy of the above data processing task when determining to use the enabling auxiliary node, so as to ensure that the data processing task can be successfully completed.
  • the number of times of policy adjustment of the data processing task can be set.
  • the number of times of adjustment of the execution strategy is used to display the number of times of adjustment of the execution strategy for the above data processing task.
  • the above step 203 includes the following sub-steps:
  • the number of times threshold refers to the maximum upper limit of the above-mentioned execution policy adjustment times, that is, the maximum adjustment times corresponding to the execution policy adjustment times.
  • the threshold value of the number of times may be a value set by the designer according to experience.
  • the adjustable state refers to the state in which the execution policy corresponding to the data processing task can be adjusted, and the non-adjustable state refers to the state in which the execution policy corresponding to the data processing task cannot be adjusted.
  • the management node After adjusting the execution strategy of the data processing task, the management node records the number of times of adjustment of the execution strategy for the data processing task. If the number of times of adjustment of the execution policy is equal to the threshold value of the times, the execution policy of the data processing task is switched from the adjustable state to the non-adjustable state.
  • Step 204 based on the execution strategy, determine the data processing subtasks corresponding to the auxiliary node and the remaining processing nodes respectively.
  • the management node determines the data processing subtasks corresponding to the auxiliary node and the remaining processing nodes respectively based on the execution strategy.
  • the management node may determine the number of auxiliary nodes according to the processing progress of the data processing task. Based on this, the above step 204 includes the following sub-steps:
  • the processing progress of the data processing task refers to the ratio between the completed part of the data processing task and the entire data processing task before the exception processing node appears.
  • the management node can obtain the processing progress of the data processing task according to the processing method for the data processing task indicated by the execution strategy, and determine the processing progress of the data processing task according to the processing progress of the data processing task.
  • the number m of auxiliary nodes is determined according to the data processing efficiency of the auxiliary nodes. where m is a positive integer.
  • the number of auxiliary nodes to be enabled is determined to ensure the processing efficiency of the data processing task and the smooth completion of the data processing task.
  • the management node may divide the unprocessed part of the data processing task, and determine the data processing subtasks corresponding to the m auxiliary nodes and the remaining processing nodes respectively.
  • the management node may generate a new task list, and send the new task list to the remaining auxiliary nodes other than the determined auxiliary nodes.
  • Step 205 Send corresponding task execution instructions to the auxiliary node and the remaining processing nodes.
  • the task execution instruction is used to instruct the auxiliary node and the remaining processing nodes to execute corresponding data processing subtasks.
  • the management node after determining the data processing subtask, the management node sends a corresponding task execution instruction to the auxiliary node and the remaining processing nodes.
  • the auxiliary node and the remaining processing nodes execute the corresponding data processing subtask.
  • the auxiliary node and the remaining processing nodes may also periodically send measurement reports to the management node.
  • the abnormal state information of the abnormal processing node is obtained. If it is determined to enable the auxiliary node according to the abnormal state information, the execution strategy of the data processing task is adjusted, and the data processing subtasks corresponding to the auxiliary node and the remaining processing nodes are re-determined, and the remaining processing nodes are the exception processing nodes in the processing node cluster.
  • the abnormal processing node is replaced by the auxiliary node, so that the auxiliary node and the remaining processing nodes can cooperate to perform data processing tasks, so as to avoid the failure of data processing tasks caused by abnormal processing nodes, improve the reliability of data processing, and ensure data processing tasks. can be successfully completed.
  • the method can prevent low data processing efficiency caused by unreasonable allocation of data processing tasks, and is beneficial to improve the data processing efficiency of each node, thereby ensuring the processing efficiency of the entire data processing task.
  • the management node determines the execution strategy according to the number of exception handling nodes. For example, the management node obtains the number of exception handling nodes after determining that the secondary node is enabled. If the number of the exception handling nodes is greater than the number threshold, the task resharding strategy is determined to be executed. At this time, the above execution strategy includes a task resharding strategy, and the task resharding strategy refers to a strategy for re-dividing the unprocessed part of the data processing task. After determining the task resharding strategy, the management node reshards the unprocessed part of the data processing task, and determines the data processing subtasks corresponding to the auxiliary nodes and the remaining processing nodes respectively.
  • the management node determines the execution strategy according to the task processing progress of the data processing subtask of the exception handling node. For example, after determining to enable the auxiliary node, the management node sends a progress query request to the exception handling node, where the progress query request is used to request to obtain the task processing progress of the exception handling node. If a data loss response from the exception processing node is received, it is determined to execute the secondary calculation strategy, and the data loss response is used to indicate that the processed data of the data processing subtask corresponding to the exception processing node is lost.
  • the above execution strategy includes a secondary calculation strategy
  • the secondary calculation strategy refers to a strategy for the auxiliary node to re-execute the data processing subtask corresponding to the exception processing node.
  • the management node After determining the secondary calculation strategy, the management node re-shards the unprocessed part of the data processing subtask of the exception processing node, and determines the data processing subtask corresponding to the auxiliary node.
  • the management node may also determine that the execution strategy is to execute the unprocessed part of the data processing subtask corresponding to the exception processing node by the auxiliary node.
  • FIG. 3 shows a flowchart of a configuration method of a processing node provided by an embodiment of the present application.
  • the execution subject of each step may be a computer device, and the computer device is, for example, the policy control node 10 in the data processing system of FIG. 1 .
  • the method may include the following steps (steps 301 to 303):
  • Step 301 acquiring a data processing task.
  • Data processing tasks refer to tasks that process a certain part of data, such as data traversal, data storage, data transformation, and data visualization.
  • the data processing task may be a task of processing all types of data, or may be a task of processing big data, which is not limited in this embodiment of the present application.
  • the policy control node may acquire the data processing tasks from the storage list of the data processing tasks.
  • the storage list is used to store various data processing tasks to avoid overloading the policy control node due to receiving too many data processing tasks.
  • Step 302 Determine task information corresponding to the data processing task.
  • the task information refers to the relevant information about the data processing situation during the execution of the data processing task.
  • the policy control node may configure corresponding processing nodes and auxiliary nodes for the above data processing tasks according to the task information.
  • the policy control node may analyze the data processing task to determine task information corresponding to the data processing task.
  • Step 303 configure a processing node cluster and auxiliary nodes other than the processing node cluster for the data processing task according to the task information.
  • a processing node cluster refers to a node cluster including a plurality of processing nodes.
  • the multiple processing nodes are used to perform the above data processing tasks cooperatively.
  • Auxiliary nodes are used to perform tasks instead of exception handling nodes when there are exception handling nodes in the processing node cluster.
  • a processing node cluster is a collection of multiple processing nodes configured for a data processing task, and the multiple processing nodes are used for co-processing the data processing task, and each processing node in the processing node cluster should execute the A part of tasks in the data processing tasks (ie, data processing subtasks).
  • the auxiliary node is an additionally configured node outside the processing node cluster. This part of the additionally configured nodes has the same or similar processing capabilities as the processing node, and can replace the processing node to perform tasks when an exception occurs on the processing node.
  • the number of configured auxiliary nodes may be one or multiple, which is not limited in this embodiment.
  • the above task information may include parallel computing acceleration ratio, task processing duration and data processing amount.
  • the parallel computing acceleration ratio is used to represent the parallel computing efficiency for the above data processing tasks;
  • the task processing time refers to the time required to complete the execution of the above data processing tasks, and the task processing time can be the expected time calculated by the policy control node, It may also be the required duration corresponding to the data processing task, which is not limited in this embodiment of the present application;
  • the data processing volume refers to the data volume that needs to be processed corresponding to the data processing task.
  • the policy control node may determine the number of processing nodes and auxiliary nodes according to the aforementioned parallel computing acceleration ratio, task processing duration, and data processing amount. For example, after acquiring the above task information, the policy control node can determine the number of processing nodes according to the task processing duration and data processing volume, so as to ensure that the data processing task can be successfully completed within the task processing duration.
  • the ratio between the processing node and the auxiliary node is determined to ensure the best processing efficiency of the data processing task, wherein the above upper limit value can be used for parallel calculation of the maximum value of the acceleration ratio , or may be a value set by the designer according to actual experience, which is not limited in this embodiment of the present application.
  • the policy control node determines the number of the auxiliary nodes according to the ratio between the processing nodes and the auxiliary nodes and according to the number of the above-mentioned processing nodes.
  • the policy control node may also determine the number of management nodes. After determining the number of processing nodes, the policy control node obtains the maximum management number of management nodes, which refers to the maximum number of processing nodes that a single management node can manage; then, according to the maximum management number and the number of processing nodes, Determine the number of management nodes. At this time, the number of management nodes and the number of processing nodes satisfy that no waiting period is required when the management node manages each processing node.
  • the policy control node After determining the number of the above-mentioned management nodes, processing nodes and auxiliary nodes, the policy control node sends data processing information to the management node, the data processing information includes the above-mentioned data processing tasks, and the corresponding management nodes, processing nodes and data processing tasks. The number of secondary nodes.
  • the policy control node may also determine a specific processing node or auxiliary node according to the data processing type corresponding to the data processing task. For example, if the data processing task is data visualization, the policy control node selects a processing node and an auxiliary node with high efficiency for data visualization, and sends the identifiers of the processing node and the auxiliary node to the management node.
  • the number of each node is determined by the task information corresponding to the data processing task, that is, different numbers of nodes are configured for different data processing tasks, so as to avoid problems caused by too few nodes.
  • Data processing tasks take too long, or resources are wasted due to too many nodes. While ensuring the reliability of data processing tasks, unnecessary waste of resources is reduced.
  • each step is meant to be exemplary and explanatory, and in practical application, the execution subject of each step may be different from the description in this application.
  • the above-mentioned management node may execute the configuration method of the processing node corresponding to the policy control node; or another node may execute the steps of dividing data processing tasks, etc., which are not limited in this embodiment of the present application.
  • FIG. 4 shows a block diagram of an apparatus for managing a processing node provided by an embodiment of the present application.
  • the device has the function of implementing the above-mentioned management method of the processing node, and the function may be implemented by hardware or by executing corresponding software by the hardware.
  • the apparatus may be computer equipment, or may be set in computer equipment.
  • the apparatus 400 may include: an information acquisition module 401 , a node enabling module 402 , a policy adjustment module 403 , a task determination module 404 and an instruction sending module 405 .
  • the information acquisition module 401 is configured to acquire abnormal state information of the abnormal processing node in the case of detecting that an abnormal processing node exists in the processing node cluster corresponding to the data processing task, and the processing node cluster includes a plurality of processing nodes, The plurality of processing nodes are used to perform the data processing tasks cooperatively.
  • the node enabling module 402 is configured to determine to enable an auxiliary node outside the processing node cluster to replace the abnormal processing node if the abnormal state information satisfies the condition.
  • the policy adjustment module 403 is configured to adjust the execution policy of the data processing task when it is determined to enable the auxiliary node, where the execution policy is used to indicate the processing mode for the data processing task.
  • a task determination module 404 configured to determine, based on the execution strategy, data processing subtasks corresponding to the auxiliary node and the remaining processing nodes respectively, where the remaining processing nodes are the exception processing nodes in the processing node cluster processing node.
  • An instruction sending module 405, configured to send a corresponding task execution instruction to the auxiliary node and the remaining processing node, where the task execution instruction is used to instruct the auxiliary node and the remaining processing node to execute the corresponding data processing subtask .
  • the policy adjustment module 403 is configured to obtain the number of the exception handling nodes; if the number of the exception handling nodes is greater than the number threshold, determine the execution task resharding strategy; wherein, The execution strategy includes the task resharding strategy, and the task resharding strategy refers to a strategy for re-dividing the unprocessed part of the data processing task.
  • the policy adjustment module 403 is configured to send a progress query request to the exception processing node, where the progress query request is used to request to obtain the task processing progress of the exception processing node; In response to the data loss response of the exception handling node, it is determined to execute the secondary calculation strategy, and the data loss response is used to indicate that the processed data of the data processing subtask corresponding to the exception handling node is lost; wherein, the execution strategy Including the secondary calculation strategy, the secondary calculation strategy refers to the strategy for the auxiliary node to re-execute the data processing subtask corresponding to the exception processing node.
  • the task determination module 404 is configured to determine, based on the execution strategy and the processing progress of the data processing task, the activation number m of the auxiliary nodes, where m is a positive integer; The unprocessed part of the data processing task is divided, and the data processing subtasks corresponding to the m auxiliary nodes and the remaining processing nodes respectively are determined.
  • the apparatus 400 further includes: a cause determination module 406 , a duration determination module 407 and a node determination module 408 .
  • the cause determination module 406 is configured to determine the abnormal cause of the abnormal processing node according to the abnormal state information.
  • the duration determining module 407 is configured to determine the time-consuming for repairing the exception processing node according to the exception cause.
  • a node determination module 408, configured to determine to enable the auxiliary node to replace the exception handling node if the repair time is greater than a threshold.
  • the apparatus 400 further includes: a frequency recording module 409 and a state switching module 410 .
  • the times recording module 409 is configured to record the times of adjustment of the execution strategy for the data processing task.
  • the state switching module 410 is configured to switch the execution strategy of the data processing task from an adjustable state to an unadjustable state in response to the execution strategy adjustment times being equal to a threshold value of the times.
  • the apparatus 400 further includes: a node repair module 411 .
  • Node repair module 411 configured to send a repair instruction to the exception processing node, where the repair instruction includes repair data for repairing the exception processing node; if a repair completion response from the exception processing node is received, determine The exception processing node recovers from the abnormal state to the normal state; sends configuration information to the exception processing node, where the configuration information is used to configure the exception processing node as the auxiliary node; The configuration completion response of the processing node determines that the exception processing node is converted to the auxiliary node.
  • the apparatus 400 further includes: an abnormality determination module 412 .
  • the abnormality determination module 412 is configured to obtain a measurement report sent by each processing node in the processing node cluster, where the measurement report includes task processing information and node status information; wherein the task processing information is used to indicate the task processing progress, the node status information is used to indicate the working status of the processing node; if the measurement report from the target processing node is abnormal, the target processing node is determined to be the abnormal processing node; wherein, the The abnormality in the measurement report includes at least one of the following: the task processing information contained in the measurement report is abnormal, the node state information contained in the measurement report is abnormal, and the measurement report is not received for a set period of time.
  • the abnormal state information of the abnormal processing node is obtained. If it is determined to enable the auxiliary node according to the abnormal state information, the execution strategy of the data processing task is adjusted, and the data processing subtasks corresponding to the auxiliary node and the remaining processing nodes are re-determined, and the remaining processing nodes are the exception processing nodes in the processing node cluster.
  • the abnormal processing node is replaced by the auxiliary node, so that the auxiliary node and the remaining processing nodes can cooperate to perform data processing tasks, so as to avoid the failure of data processing tasks caused by abnormal processing nodes, improve the reliability of data processing, and ensure data processing tasks. can be successfully completed.
  • the method can prevent low data processing efficiency caused by unreasonable allocation of data processing tasks, and is beneficial to improve the data processing efficiency of each node, thereby ensuring the processing efficiency of the entire data processing task.
  • FIG. 6 shows a block diagram of an apparatus for configuring a processing node provided by an embodiment of the present application.
  • the device has the function of implementing the above-mentioned configuration method of the processing node, and the function can be implemented by hardware or by executing corresponding software by the hardware.
  • the apparatus may be computer equipment, or may be set in computer equipment.
  • the apparatus 600 may include: a task acquisition module 601 , an information determination module 602 and a node configuration module 603 .
  • the task acquisition module 601 is used for acquiring data processing tasks.
  • the information determination module 602 is configured to determine task information corresponding to the data processing task, where the task information refers to information related to data processing conditions during the execution of the data processing task.
  • a node configuration module 603, configured to configure a processing node cluster and auxiliary nodes other than the processing node cluster for the data processing task according to the task information.
  • the processing node cluster includes multiple processing nodes, and the multiple processing nodes are used to perform the data processing task cooperatively; the auxiliary node is used when there are abnormal processing nodes in the processing node cluster , to perform the task in place of the exception handling node.
  • the task information includes parallel calculation acceleration ratio, task processing duration and data processing amount;
  • the node configuration module 603 is configured to determine the the number of processing nodes; when the parallel computing acceleration ratio reaches the upper limit value, determine the ratio between the processing node and the auxiliary node; according to the ratio between the processing node and the auxiliary node, The number of auxiliary nodes is determined according to the number of processing nodes.
  • the apparatus 600 further includes: a quantity acquisition module 604 and a quantity determination module 605 .
  • a quantity obtaining module 604 configured to obtain the maximum managed quantity of the management nodes, where the maximum managed quantity refers to the maximum quantity of the processing nodes that a single management node can manage.
  • a quantity determination module 605, configured to determine the quantity of the management nodes according to the maximum management quantity and the quantity of the processing nodes.
  • the abnormal state information of the abnormal processing node is obtained. If it is determined to enable the auxiliary node according to the abnormal state information, the execution strategy of the data processing task is adjusted, and the data processing subtasks corresponding to the auxiliary node and the remaining processing nodes are re-determined, and the remaining processing nodes are the exception processing nodes in the processing node cluster.
  • the abnormal processing node is replaced by the auxiliary node, so that the auxiliary node and the remaining processing nodes can cooperate to perform data processing tasks, so as to avoid the failure of data processing tasks caused by abnormal processing nodes, improve the reliability of data processing, and ensure data processing tasks. can be successfully completed.
  • the method can prevent low data processing efficiency caused by unreasonable allocation of data processing tasks, and is beneficial to improve the data processing efficiency of each node, thereby ensuring the processing efficiency of the entire data processing task.
  • FIG. 8 shows a structural block diagram of a computer device provided by an embodiment of the present application.
  • the computer device can be used to implement the above-mentioned management method of the processing node, or to implement the function of the configuration method of the processing node. Specifically:
  • the computer device 800 includes a central processing unit (Central Processing Unit, CPU) 801, a system memory 804 including a random access memory (Random Access Memory, RAM) 802 and a read only memory (Read Only Memory, ROM) 803, and is connected to the system memory 804 and the system bus 805 of the central processing unit 801.
  • Computer device 800 also includes a basic input/output (I/O) system 806 that facilitates the transfer of information between various devices within the computer, and a large system for storing operating system 813, application programs 814, and other program modules 812. Capacity storage device 807.
  • I/O basic input/output
  • Basic input/output system 806 includes a display 808 for displaying information and input devices 809 such as a mouse, keyboard, etc., for user input of information. Both the display 808 and the input device 809 are connected to the central processing unit 801 through the input and output controller 88 connected to the system bus 805 .
  • the basic input/output system 806 may also include an input output controller 810 for receiving and processing input from various other devices such as a keyboard, mouse, or electronic stylus. Similarly, input output controller 810 also provides output to a display screen, printer, or other type of output device.
  • Mass storage device 807 is connected to central processing unit 801 through a mass storage controller (not shown) connected to system bus 805 .
  • Mass storage device 807 and its associated computer-readable media provide non-volatile storage for computer device 800 . That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
  • a computer-readable medium such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
  • Computer-readable media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM (Erasable Programmable Read Only Memory, Erasable Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory, Electrically Erasable Programmable Read Only Memory), flash memory or other solid-state storage Its technology, CD-ROM, DVD (Digital Video Disc, High Density Digital Video Disc) or other optical storage, cassette, magnetic tape, disk storage or other magnetic storage devices.
  • the system memory 804 and the mass storage device 807 described above may be collectively referred to as memory.
  • the computer device 800 may also operate through a network connection to a remote computer on a network, such as the Internet. That is, computer device 800 may be connected to network 812 through network interface unit 811 connected to system bus 805, or may use network interface unit 811 to connect to other types of networks or remote computer systems (not shown).
  • the memory also includes a computer program stored in the memory and configured to be executed by one or more processors to implement the above-described method of managing a processing node, or to implement the above-described method of configuring a processing node.
  • a computer-readable storage medium stores at least one instruction, at least one piece of program, code set or instruction set, the at least one instruction, the at least one piece of program .
  • code set or the instruction set is executed by the processor, the above-mentioned method for managing a processing node or a method for configuring the above-mentioned processing node is implemented.
  • the computer-readable storage medium may include: ROM (Read Only Memory, read-only memory), RAM (Random Access Memory, random access memory), SSD (Solid State Drives, solid-state hard disk), or an optical disk.
  • the random access memory may include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory, dynamic random access memory).
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned method for managing the processing node or the above-mentioned method for configuring the processing node.
  • references herein to "a plurality” means two or more.
  • "And/or" which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone.
  • the character "/" generally indicates that the associated objects are an "or” relationship.
  • the numbering of the steps described in this document only exemplarily shows a possible execution sequence between the steps. In some other embodiments, the above steps may also be executed in different order, such as two different numbers. The steps are performed at the same time, or two steps with different numbers are performed in a reverse order to that shown in the figure, which is not limited in this embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

本申请公开了一种处理节点的管理方法、装置、设备及存储介质,属于云技术和大数据领域。所述方法包括:在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取异常处理节点的异常状态信息;若异常状态信息满足条件,则确定启用处理节点集群之外的辅助节点代替异常处理节点;在确定启用辅助节点的情况下,调整数据处理任务的执行策略;基于执行策略,确定辅助节点和剩余处理节点分别对应的数据处理子任务;向辅助节点和剩余处理节点发送对应的任务执行指令。本申请提供的方案,能够提高数据处理的可靠性,保证数据处理任务能够顺利且高效地完成。

Description

处理节点的管理方法、配置方法及相关装置
本申请要求于2020年7月8日提交中国专利局、申请号202010652008.1、申请名称为“数据处理节点的管理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及云技术和大数据领域,特别涉及处理节点的管理技术。
背景技术
大数据(Big Data)是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。
在相关技术中,计算机设备采用并行计算(Parallel Computing)的方式对大数据进行数据处理。其中,并行计算是指同时使用多种计算资源解决计算问题的过程。并行计算的基本思想是用多个处理器来协同求解同一问题,即将被求解的问题分解成若干个部分,各部分均由一个独立的处理器来计算。也就是说,对于同一数据处理任务,计算机设备可以让不同的处理节点处理该数据处理任务中的不同部分,进而实现对大数据的并行计算。
然而,在相关技术中,在数据处理任务的处理过程中,若某个处理节点出现异常,则对应的数据处理任务无法正常完成,数据处理的可靠性低。
发明内容
本申请实施例提供了一种处理节点的管理方法、装置、设备及存储介质,能够提高数据处理的可靠性,保证数据处理任务能够顺利完成。所述技术方案如下:
根据本申请实施例的一个方面,提供了一种处理节点的管理方法,所述方法由计算机设备执行,所述方法包括:
在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取所述异常处理节点的异常状态信息,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;
若所述异常状态信息满足条件,则确定启用所述处理节点集群之外的辅助节点代替所述异常处理节点;
在确定启用所述辅助节点的情况下,调整所述数据处理任务的执行策略,所述执行策略用于指示针对所述数据处理任务的处理方式;
基于所述执行策略,确定所述辅助节点和剩余处理节点分别对应的数据处理子任务,所述剩余处理节点为所述处理节点集群中除所述异常处理节点之外的处理节点;
向所述辅助节点和所述剩余处理节点发送对应的任务执行指令,所述任务执行指令用于指示所述辅助节点和所述剩余处理节点执行对应的数据处理子任务。
根据本申请实施例的一个方面,提供了一种处理节点的配置方法,所述方法由计算机设备执行,所述方法包括:
获取数据处理任务;
确定所述数据处理任务对应的任务信息,所述任务信息是指所述数据处理任务在执行过程中针对于数据的处理情况的相关信息;
根据所述任务信息为所述数据处理任务配置处理节点集群,以及除所述处理节点集群之外的辅助节点;
其中,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;所述辅助节点用于在所述处理节点集群中存在异常处理节点的情况下,代替所述异常处理节点执行任务。
根据本申请实施例的一个方面,提供了一种处理节点的管理装置,所述装置部署在计算机设备上,所述装置包括:
信息获取模块,用于在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取所述异常处理节点的异常状态信息,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;
节点启用模块,用于若所述异常状态信息满足条件,则确定启用所述处理节点集群之外的辅助节点代替所述异常处理节点;
策略调整模块,用于在确定启用所述辅助节点的情况下,调整所述数据处理任务的执行策略,所述执行策略用于指示针对所述数据处理任务的处理方式;
任务确定模块,用于基于所述执行策略,确定所述辅助节点和剩余处理节点分别对应的数据处理子任务,所述剩余处理节点为所述处理节点集群中除所述异常处理节点之外的处理节点;
指令发送模块,用于向所述辅助节点和所述剩余处理节点发送对应的任务执行指令,所述任务执行指令用于指示所述辅助节点和所述剩余处理节点执行对应的数据处理子任务。
根据本申请实施例的一个方面,提供了一种处理节点的配置装置,所述装置部署在计算机设备上,所述装置包括:
任务获取模块,用于获取数据处理任务;
信息确定模块,用于确定所述数据处理任务对应的任务信息,所述任务信息是指所述数据处理任务在执行过程中针对于数据的处理情况的相关信息;
节点配置模块,用于根据所述任务信息为所述数据处理任务配置处理节点集群,以及除所述处理节点集群之外的辅助节点;
其中,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;所述辅助节点用于在所述处理节点集群中存在异常处理节点的情况下,代替所述异常处理节点执行任务。
根据本申请实施例的一个方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述处理节点的管理方法,或实现上述处理节点的配置方法。
根据本申请实施例的一个方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述处理节点的管理方法,或实现上述处理节点的配置方法。
根据本申请实施例的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述处理节点的管理方法,或执行上述处理节点的配置方法。
本申请实施例提供的技术方案可以带来如下有益效果:
在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取异常处理节点的异常状态信息。若根据异常状态信息确定启用辅助节点时,对数据处理任务的执行策略进行调整,重新确定辅助节点和剩余处理节点分别对应的数据处理子任务,剩余处理节点为处理节点集群中除异常处理节点之外的处理节点,从而通过辅助节点代替异常处理节点,以便通过辅助节点和剩余处理节点协同执行数据处理任务,避免处理节点异常造成的数据处理任务失败,提高数据处理的可靠性,保证数据处理任务能够顺利完成。另外,该方法可以防止因数据处理任务分配不合理造成的数据处理效率低,有利于提高各个节点的数据处理效率,进而保证整个数据处理任务的处理效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一个实施例提供的数据处理系统的示意图;
图2是本申请一个实施例提供的处理节点的管理方法的流程图;
图3是本申请一个实施例提供的处理节点的配置方法的流程图;
图4是本申请一个实施例提供的处理节点的管理装置的框图;
图5是本申请另一个实施例提供的处理节点的管理装置的框图;
图6是本申请一个实施例提供的处理节点的配置装置的框图;
图7是本申请另一个实施例提供的处理节点的配置装置的框图;
图8是本申请一个实施例提供的计算机设备的结构框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
请参考图1,其示出了本申请一个实施例提供的数据处理系统的示意图。该数据处理系统可以包括策略控制节点10、管理节点20、处理节点30和辅助节点40。
策略控制节点10用于确定管理节点20、处理节点30和辅助节点40的数量。通常情况下,策略控制节点10在获取数据处理任务之后,对该数据处理任务所需要处理的数据进行分析,确定该数据处理任务对应的任务信息。其中,该任务信息是指该数据处理任务在执行过程中针对于数据的处理信息,如数据处理量、任务处理时长、并行计算加速度比等。
在一种可能的实施方式中,策略控制节点10根据数据处理量和任务处理时长,确定处理节点30的数量。然后,依据处理节点30的数量与并行计算加速度比,确定辅助节点40的数量。之后,依据管理节点20对处理节点30的管理能力,以及处理节点30的数量,确定管理节点20的数量。在本申请实施例中,策略控制节点10在确定上述管理节点20、处理节点30和辅助节点40的数量之后,向管理节点10发送数据处理信息,该数据处理信息中包括上述数据处理任务,以及该数据处理任务对应的管理节点20、处理节点30和辅助节点40的数量。
管理节点20用于管理处理节点30和辅助节点40。管理节点20可以向处理节点30和辅助节点40发送任务执行指令,该任务执行指令用于控制处理节点30和辅助节点40执行对应的操作。在一种可能的实施方式中,管理节点20控制处理节点30执行上述数据处理任务。例如,管理节点20在接收上述数据处理信息之后,根据处理节点30的数量对该数据处理任务进行划分,确定每个处理节点30对应的数据处理子任务,然后,向处理节点30发送任务执行指令,该任务执行指令用于控制处理节点30执行对应的数据处理子任务。
在另一种可能的实施方式中,管理节点20控制辅助节点40代替处理节点30执行上述数据处理任务。例如,管理节点20在检测到处理节点30处于异常状态时,获取该处理节点30的异常原因,并根据该异常原因确定该处理节点30的修复耗时,并在修复耗时大于阈值时,确定启用辅助节点40代替处于异常状态的处理节点30,与剩余处理节点协同执行数据处理任务,并向辅助节点40发送任务执行指令,该任务执行指令用于控制辅助节点40执行对应的数据处理子任务。需要说明的一点是,在本申请实施例中,管理节点20可以同时启动多个辅助节点40。
处理节点30用于执行数据处理任务。在本申请实施例中,多个处理节点30可以协同处理同一数据处理任务,该多个处理节点30可以组成处理节点集群。其中,处理节点30的数量由上述策略控制节点10确定,本申请实施例对此不作限定。
在一种可能的实现方式中,处理节点30在接收到上述任务执行指令之后,根据该任务执行指令执行对应的数据处理子任务,并向管理节点20周期性发送测量报告,该测量报告包括任务处理信息和节点状态信息。其中,任务处理信息用于指示处理节点30的任务处理进度,节点状态信息用于指示处理节点30的工作状态。对应的,管理节点20在接收到该测量报告之后,根据该测量报告中的节点状态信息确定该处理节点30是否处于异常状态。
辅助节点40用于在处理节点30处于异常状态时,代替该处理节点30执行数据处理任务。在本申请实施例中,辅助节点40可以在接收上述任务执行指令之后,根据该任务执行指令执行对应的数据处理子任务。若辅助节点40未接收到上述任务执行指令,则向管理节点20周期性发送心跳检测包,该心跳检测包用于向管理节点20表征辅助节点40处于可分配任务的状态;若辅助节点40接收到上述任务执行指令,则向管理节点周期性发送上述测量报告,该测量报告包括任务处理信息和节点状态信息。其中,任务处理信息用于指示辅助节点40的任务处理进度,节点状态信息用于指示辅助节点40的工作状态。
在本申请实施例中,上述数据处理任务可以是针对大数据的处理任务。其中,大数据(Big data)是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。随着云时代的来临,大数据也吸引了越来越多的关注,大数据需要特殊的技术,以有效地处理大量的容忍经过时间内的数据。适用于大数据的技术,包括大规模并行处理数据库、数据挖掘、分布式文件系统、分布式数据库、云计算平台、互联网和可扩展的存储系统。
上述图1所示的数据处理系统,可以构成云技术中的大数据处理系统。该大数据处理系统中可以包括多个服务器,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。上 述策略控制节点10、管理节点20、处理节点30和辅助节点40可以部署在不同的服务器上,本申请实施例对此不作限定。
需要说明的是,上述策略控制节点10、管理节点20、处理节点30和辅助节点40之间通过网络进行通信。
下面,将结合几个实施例对本申请技术方案进行详细的介绍说明。
请参考图2,其示出了本申请一个实施例提供的处理节点的管理方法的流程图。其中,各个步骤的执行主体可以是计算机设备,该计算机设备例如为图1数据处理系统中的管理节点20。该方法可以包括以下几个步骤(步骤201~步骤205):
步骤201,在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取异常处理节点的异常状态信息。
数据处理任务是指对某部分数据进行处理的任务,如数据遍历、数据存储、数据转换和数据可视化等。可以理解的是,该数据处理任务可以是针对所有类型的数据进行处理的任务,也可以是针对大数据进行处理的任务,本申请实施例对此不作限定。
处理节点集群是指包括多个处理节点的节点集群。在本申请实施例中,该多个处理节点用于协同执行上述数据处理任务。通常情况下,管理节点在获取上述数据处理任务之后,可以根据处理节点对该数据处理任务进行划分,确定不同的处理节点对应的数据处理子任务。
在一种可能的实施方式中,管理节点根据处理节点的数量对数据处理任务进行划分。例如,管理节点在获取上述数据处理任务之后,依据该数据处理任务对应的数据处理量和处理节点对应的数量,对该数据处理任务进行平均划分,确定每个处理节点对应的数据处理子任务的数据处理量相同,保证处理节点的工作负荷均衡。
在另一种可能的实施方式中,管理节点根据处理节点的工作能力对数据处理任务进行划分。例如,管理节点在获取上述数据处理任务之后,依据该数据处理任务对应的数据处理量和各个处理节点的数据处理效率,对该数据处理任务进行划分,确定每个处理节点对应的数据处理子任务,避免处理节点超负荷工作。
在再一种可能的实施方式中,管理节点根据处理节点的工作类型对数据处理任务进行划分。例如,管理节点再获取上述数据处理任务之后,依据该数据处理任务对应的数据处理类型和各个处理节点的工作类型,对该数据处理任务进行划分,确定每个处理节点对应的数据处理子任务,以保证各个处理节点的高效处理效率。以数据处理任务是数据转换任务为例,假设数据转换任务中包括A数据与B数据之间的转换,则将A数据对B数据的数据转换任务划分至第一处理节点,将B数据对A数据的数据转换任务划分至第二处理节点。
在本申请实施例中,管理节点在确定各个处理节点对应的数据处理子任务之后,可以生成任务清单,并向辅助节点发送该任务清单。其中,该任务清单中记录了处理节点与数据处理子任务之间的对应关系。
异常处理节点是指处于非正常工作状态中的处理节点,如处理节点的存储器损坏造成的处理节点无法正常工作。异常状态信息用于指示某个处理节点处于非正常工作状态的原因。在本申请实施例中,管理节点在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取异常处理节点的异常状态信息。
在一种可能的实现方式中,管理节点可以通过测量报告是否存在异常来确定处理节点是否为异常处理节点。其中,上述测量报告是指用于向管理节点报告处理节点的工作状态和任务处理进度的报告。基于此,上述步骤201之前包括以下几个子步骤:
1、获取处理节点集群中各处理节点发送的测量报告。
测量报告中包括任务处理信息和节点状态信息。其中,任务处理信息用于指示处理节点的任务处理进度,节点状态信息用于指示处理节点的工作状态。
在本申请实施例中,处理节点在执行上述数据处理子任务的过程中,可以周期性地向管理节点发送测量报告。对应于,管理节点接收该测量报告,并依据该测量报告对处理节点的工作状态和任务处理进度进行检测。
2、若来自于目标处理节点的测量报告存在异常,则确定目标处理节点为异常处理节点。
若管理节点判定来自于某个处理节点例如目标处理节点的测量报告存在异常,则确定该目标处理节点为异常处理节点。其中,测量报告存在异常包括以下至少一项:测量报告中包含的任务处理信息存在异常、测量报告中包含的节点状态信息存在异常、超过设定时长未接收到测量报告。
在一种可能的实施方式中,管理节点接收到目标处理节点的测量报告之后,对该测量报告中的任务处理信息进行分析,确定该任务处理信息存在异常,如任务处理信息对应的任务处理进度低于预期目标值,从而确定该目标处理节点为异常处理节点。
在另一种可能的实施方式中,管理节点接收到目标处理节点的测量报告之后,对该测量报告中的节点状态信息进行分析,确定该节点状态信息存在异常,如节点状态信息指示该目标处理节点处于低速率工作状态,从而确定该目标处理节点为异常处理节点。
在再一种可能的实施方式中,管理节点超过设定时长未接收到目标处理节点的测量报告,确定目标处理节点的测量报告存在异常,进而确定目标处理节点为异常处理节点。在一种可能的实施方式中,若管理节点在第一时长内未接收到目标处理节点的测量报告,则确定该测量报告存在异常,目标处理节点为异常处理节点。在另一种可能的实施方式中,若管理节点在第一时长内未接收到目标处理节点的测量报告,则向上述目标处理节点发送报告获取请求,该报告获取请求用于向目标处理节点请求获取测量报告。对应的,若目标 处理节点在第二时长内向管理节点反馈对应的测量报告,则管理节点依据测量报告判断该目标处理节点是否为异常处理节点;若目标处理节点在第二时长内未向管理节点反馈对应的测量报告,则管理节点确定该目标处理节点为异常处理节点。
在本申请实施例中,管理节点在确定数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取该异常处理节点的异常状态信息。在一种可能的实施方式中,异常处理节点在向管理节点发送测量报告的同时,发送对应的异常状态信息,例如,该异常状态信息存在于上述测量报告中。在另一种可能的实施方式中,管理节点在确定异常处理节点之后,向该异常处理节点发送状态信息获取请求,该状态信息获取请求用于请求获取异常处理节点对应的异常状态信息。对应的,异常处理节点根据状态信息获取请求向管理节点发送对应的异常状态信息。
步骤202,若异常状态信息满足条件,则确定启用处理节点集群之外的辅助节点代替异常处理节点。
条件是指用于判定是否启用辅助节点的判断条件。在本申请实施例中,管理节点在获取上述异常处理节点的异常状态信息之后,可以对该异常状态信息进行分析,若该异常状态信息满足条件,则确定启用处理节点集群之外的辅助节点代替该异常处理节点。
在一种可能的实施方式中,管理节点在获取上述异常状态信息之后,根据该异常状态信息确定上述异常处理节点的异常原因。其中,该异常原因可以包括于异常状态信息中,管理节点在获取异常状态信息之后,直接从该异常状态信息中确定异常原因。管理节点在获取异常原因之后,根据该异常原因确定异常处理节点的修复成本,该修复成本是指修复上述异常处理节点所消耗的资源,如修复耗时、修复操作难易度和修复所需要的数据等。在上述修复成本大于阈值时,则确定修改该异常处理节点的性价比低,启用所述辅助节点代替所述异常处理节点。
在另一种可能的实施方式中,管理节点无法获取上述异常状态信息。若异常处理节点为无法发送信息的处理节点,则管理节点可以在未获取异常状态信息的情况下,确定该异常处理节点不可修复,并确定启用辅助节点代替该异常处理节点。例如,若该异常处理节点无法发送测量报告,则管理节点可以默认该异常处理节点无法发送异常状态信息,确定启用辅助节点代替该异常处理节点。
需要说明的一点是,在本申请实施例中,管理节点在确定启用辅助节点代替上述异常处理节点之后,可以对该异常处理节点进行修复配置,将该异常处理节点转换为辅助节点。可选地,上述步骤202之后包括以下几个子步骤:
1、向异常处理节点发送修复指令。
修复指令用于对上述异常处理节点进行修复。管理节点在确定启用辅助节点代替上述异常处理节点之后,在辅助节点执行数据处理任务的同时,对异常处理节点进行修复,向该异常处理节点发送修复指令。
在一种可能的实现方式中,上述修复指令包括用于修复该异常处理节点的修复操作。这样,管理节点在确定上述异常原因之后,基于上述异常原因确定针对于异常处理节点的修复操作,并根据该修复操作生成修复指令,向异常处理节点发送该修复指令。对应的,异常处理节点在接收到上述修复指令之后,根据该修复指令中的修复操作进行自身修复。
在一种可能的实施方式中,上述修复操作中包括修复数据。这样,异常处理节点在接收到上述修复指令之后,可以根据修复指令获取修复数据,并根据修复操作中的操作指示对该修复数据执行对应的操作,以进行自身修复。在另一种可能的实施方式中,为了防止修复数据传输失败,上述修复指令还包括修复数据的获取地址和/或标识信息。这样,异常处理节点在接收到上述修复指令之后,可以根据修复指令获取修复数据获取地址和/或标识信息,并依据该获取地址和/或标识信息获取修复数据,然后,根据修复操作中的操作指示对该修复数据执行对应的操作,以进行自身修复。
2、若接收到来自于异常处理节点的修复完成响应,确定异常处理节点从异常状态恢复至正常状态。
修复完成响应用于指示上述异常处理节点完成自身修复。若异常处理节点根据上述修复指令成功完成自身修复,则向管理节点发送修复完成响应。对应的,在管理节点接收到来自于异常处理节点的修复完成响应时,确定该异常处理节点从异常状态恢复至正常状态。其中,正常状态是指异常处理节点能够正常工作的状态。
3、向异常处理节点发送配置信息。
配置信息用于将异常处理节点配置为辅助节点。管理节点在确定上述异常处理节点恢复至正常状态时,向异常处理节点发送配置信息。对应的,该异常处理节点根据该配置信息进行配置,将异常处理节点转换为辅助节点。
可以理解的是,上述配置信息中可以包括针对于辅助节点的全部配置信息,也可以包括辅助节点相对于异常处理节点需要重新配置的部分对应的信息,本申请实施例对此不作限定。
4、若接收到来自于异常处理节点的配置完成响应,确定异常处理节点转换为辅助节点。
配置完成响应用于指示上述异常处理节点完成配置。若异常处理节点根据上述配置信息成功完成配置之后,可以向管理节点发送配置完成响应。对应的,在管理节点接收到来自于异常处理节点的配置完成响应时,确定异常处理节点转换为辅助节点。
该异常处理节点转换为辅助节点之后,可以周期性地向管理节点发送心跳检测包,该心跳检测包用于向管理节点表征该异常处理节点处于可分配任务的状态。
在确定启用辅助节点代替所述异常处理节点之后,对异常处理节点进行修复配置,将异常处理节点转换为辅助节点,对启用辅助节点之后对辅助节点进行数量补充,保证处理节点系统有足够的辅助节点。
步骤203,在确定启用辅助节点的情况下,调整数据处理任务的执行策略。
执行策略用于指示针对数据处理任务的处理方式,如针对数据处理任务的划分方式、针对数据处理任务的执行方式等。在本实施例中,管理节点可以根据该执行策略,确定辅助节点或剩余处理节点分别对应的数据处理子任务,剩余处理节点为处理节点集群中除异常处理节点之外的处理节点。在本申请实施例中,管理节点在确定使用启用辅助节点的情况下,调整上述数据处理任务的执行策略,以保证数据处理任务能够顺利完成。
需要说明的一点是,在本申请实施例中,可以设置数据处理任务的策略调整次数。其中,执行策略调整次数用于显示针对于上述数据处理任务调整执行策略的次数。基于此,上述步骤203之后包括以下几个子步骤:
1、记录针对数据处理任务的执行策略调整次数;
2、响应于执行策略调整次数等于次数门限值,将数据处理任务的执行策略从可调整状态切换至不可调整状态。
次数门限值是指上述执行策略调整次数的最大上限值,即该执行策略调整次数对应的最大调整次数。其中,该次数门限值可以是设计人员根据经验所设置的数值。可调整状态是指数据处理任务对应的执行策略可以进行调整的状态,不可调整状态是指数据处理任务对应的执行策略不可以进行调整的状态。
在本申请实施例中,管理节点在调整数据处理任务的执行策略之后,记录针对该数据处理任务的执行策略调整次数。若执行策略调整次数等于次数门限值,则将数据处理任务的执行策略从可调整状态切换至不可调整状态。
在数据处理任务的执行策略切换至不可调整状态之后,若处理节点集群中的剩余处理节点中存在异常处理节点,则确定该数据处理任务执行失败,并记录该数据处理任务的失败原因。
通过对数据处理任务的策略执行次数进行限制,防止因不可挽回的原因造成数据处理任务不断执行失败而触发多次策略调整机制,造成不必要的处理开销。
步骤204,基于执行策略,确定辅助节点和剩余处理节点分别对应的数据处理子任务。
在本申请实施例中,管理节点在确定执行策略之后,基于该执行策略,确定辅助节点和剩余处理节点分别对应的数据处理子任务。在一种可能的实施方式中,为了保证数据处理任务能够在规定时间内完成,管理节点可以根据数据处理任务的处理进度确定辅助节点的数量。基于此,上述步骤204包括以下几个子步骤:
1、基于执行策略和数据处理任务的处理进度,确定辅助节点的启用数量m。
数据处理任务的处理进度是指在异常处理节点出现之前,数据处理任务中已完成的部分与数据处理任务的全部之间的比值。在确定上述执行策略之后,管理节点可以根据该执 行策略所指示的针对数据处理任务的处理方式,获取数据处理任务的处理进度,并根据该数据处理任务处理任务的处理进度,确定数据处理任务的未处理部分,根据辅助节点对数据的处理效率,确定辅助节点的数量m。其中,m为正整数。
根据执行策略和数据数据处理任务的处理进度,确定所启用的辅助节点的数量,确保数据处理任务的处理效率,保证数据处理任务的顺利完成。
2、对数据处理任务的未处理部分进行划分,确定m个辅助节点和剩余处理节点分别对应的数据处理子任务。
在本申请实施例中,在确定辅助节点的数量之后,管理节点可以对数据数据处理任务的未处理部分进行划分,确定m个辅助节点和剩余处理节点分别对应的数据处理子任务。
需要说明的一点是,对数据处理任务的未处理部分的划分方式与步骤201中所介绍的划分方式类似,在此不作赘述。
管理节点在确定上述数据处理子任务之后,可以生成新的任务清单,并向除所确定的辅助节点之外的剩余辅助节点发送新的任务清单。
步骤205,向辅助节点和剩余处理节点发送对应的任务执行指令。
任务执行指令用于指示辅助节点和剩余处理节点执行对应的数据处理子任务。在本申请实施例中,管理节点在确定上述数据处理子任务之后,向辅助节点和剩余处理节点发送对应的任务执行指令。对应的,辅助节点和剩余处理节点接收到该任务执行指令之后,执行对应的数据处理子任务。
辅助节点和剩余处理节点接收到该任务执行指令之后,还可以向管理节点周期性地发送测量报告。
综上所述,本申请实施例提供的技术方案中,在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取异常处理节点的异常状态信息。若根据异常状态信息确定启用辅助节点时,对数据处理任务的执行策略进行调整,重新确定辅助节点和剩余处理节点分别对应的数据处理子任务,剩余处理节点为处理节点集群中除异常处理节点之外的处理节点,从而通过辅助节点代替异常处理节点,以便通过辅助节点和剩余处理节点协同执行数据处理任务,避免处理节点异常造成的数据处理任务失败,提高数据处理的可靠性,保证数据处理任务能够顺利完成。另外,该方法可以防止因数据处理任务分配不合理造成的数据处理效率低,有利于提高各个节点的数据处理效率,进而保证整个数据处理任务的处理效率。
下面,对数据处理任务的执行策略的调整方式进行介绍。
在一种可能的实施方式中,管理节点根据异常处理节点的数量确定执行策略。例如,管理节点在确定启用辅助节点之后,获取异常处理节点的数量。若该异常处理节点的数量大于数量门限值,则确定执行任务重分片策略。此时,上述执行策略中包括任务重分片策 略,该任务重分片策略是指对数据处理任务的未处理部分进行重新划分的策略。管理节点在确定任务重分片策略之后,对数据处理任务的未处理部分进行重分片,确定辅助节点和剩余处理节点分别对应的数据处理子任务。
在另一种可能的实施方式中,管理节点通过异常处理节点的数据处理子任务的任务处理进度确定执行策略。例如,管理节点在确定启用辅助节点之后,向异常处理节点发送进度查询请求,该进度查询请求用于请求获取异常处理节点的任务处理进度。若接收到来自于该异常处理节点的数据丢失响应,则确定执行二次计算策略,该数据丢失响应用于指示异常处理节点对应的数据处理子任务的已处理数据丢失。此时,上述执行策略包括二次计算策略,二次计算策略是指辅助节点重新执行异常处理节点对应的数据处理子任务的策略。管理节点在确定二次计算策略之后,对异常处理节点的数据处理子任务的未处理部分进行重分片,确定辅助节点对应的数据处理子任务。当然,若异常处理节点对应的数据处理子任务的已处理数据未丢失,则管理节点也可以确定执行策略为由辅助节点执行异常处理节点对应的数据处理子任务的未处理部分。
请参考图3,其示出了本申请一个实施例提供的处理节点的配置方法的流程图。其中,各个步骤的执行主体可以是计算机设备,该计算机设备例如为图1数据处理系统中的策略控制节点10。该方法可以包括以下几个步骤(步骤301~步骤303):
步骤301,获取数据处理任务。
数据处理任务是指对某部分数据进行处理的任务,如数据遍历、数据存储、数据转换和数据可视化等。该数据处理任务可以是针对所有类型的数据进行处理的任务,也可以是针对大数据进行处理的任务,本申请实施例对此不作限定。
在本申请实施例中,策略控制节点可以从数据处理任务的存储列表中获取数据处理任务。该存储列表用于存储各种数据处理任务,避免策略控制节点因接收过多的数据处理任务而超负荷工作。
步骤302,确定数据处理任务对应的任务信息。
任务信息是指数据处理任务在执行过程中针对于数据的处理情况的相关信息。策略控制节点可以根据该任务信息为上述数据处理任务配置对应的处理节点和辅助节点。在本申请实施例中,策略控制节点在获取上述数据处理任务之后,可以根据对该数据处理任务进行分析,确定该数据处理任务对应的任务信息。
步骤303,根据任务信息为数据处理任务配置处理节点集群,以及除处理节点集群之外的辅助节点。
处理节点集群是指包括多个处理节点的节点集群。在本申请实施例中,该多个处理节点用于协同执行上述数据处理任务。辅助节点用于在处理节点集群中存在异常处理节点的 情况下,代替异常处理节点执行任务。换句话说,处理节点集群是为数据处理任务配置的多个处理节点所构成的集合,该多个处理节点用于协同处理该数据处理任务,该处理节点集群中的每一个处理节点都要执行该数据处理任务中的一部分任务(即数据处理子任务)。辅助节点是在处理节点集群之外,额外配置的节点,这部分额外配置的节点具有与处理节点相同或相似的处理能力,能够在处理节点发生异常时顶替该处理节点执行任务。其中,配置的辅助节点可以是一个,也可以是多个,本实施例对此不做限定。
在一种可能的实现方式中,上述任务信息可以包括并行计算加速度比、任务处理时长和数据处理量。其中,并行计算加速度比用于表征针对于上述数据处理任务的并行计算效率;任务处理时长是指上述数据处理任务执行完成所需要的时长,该任务处理时长可以是策略控制节点计算的预期时长,也可以是数据处理任务对应的要求时长,本申请实施例对此不作限定;数据处理量是指数据处理任务对应的所需要处理的数据量。
在本申请实施例中,策略控制节点可以根据上述并行计算加速度比、任务处理时长和数据处理量,确定处理节点和辅助节点的数量。例如,策略控制节点在获取上述任务信息之后,可以根据任务处理时长和数据处理量,确定处理节点的数量,保证数据处理任务在任务处理时长内能够顺利完成。然后,在并行计算加速度比达到上限值的情况下,确定处理节点和辅助节点之间的比例,保证数据处理任务的最佳处理效率,其中,上述上限值可以并行计算加速度比的最大值,也可以是设计人员根据实际经验设置的数值,本申请实施例对此不做限定。之后,策略控制节点依据处理节点和辅助节点之间的比例,根据上述处理节点的数量,进而确定辅助节点的数量。
需要说明的一点是,在本申请实施例中,策略控制节点还可以确定管理节点的数量。策略控制节点在确定处理节点的数量之后,获取管理节点的最大管理数量,该最大管理数量是指单个管理节点所能管理的处理节点的最大数量;然后,根据最大管理数量和处理节点的数量,确定管理节点的数量。此时,管理节点的数量与处理节点的数量之间满足管理节点管理每个处理节点时不需要等待时长。
策略控制节点在确定上述管理节点、处理节点和辅助节点的数量之后,向管理节点发送数据处理信息,该数据处理信息中包括上述数据处理任务,以及该数据处理任务对应的管理节点、处理节点和辅助节点的数量。
当然,在其它可能的实施方式中,策略控制节点也可以根据数据处理任务对应的数据处理类型,确定具体的处理节点或辅助节点。例如,若数据处理任务为数据可视化,则策略控制节点选择对数据可视化效率高的处理节点和辅助节点,并将该处理节点和辅助节点的标识发送至管理节点。
综上所述,在本申请实施例提供的技术方案中,通过数据处理任务对应的任务信息确定各个节点的数量,即针对不同的数据处理任务配置不同的节点数量,避免节点数量过少造成的数据处理任务耗时过长,或者节点数量过多造成的资源浪费,在保证数据处理任务的可靠性的同时,减少不必要的资源浪费。
需要说明的一点是,上述对各个步骤的介绍指示示例性和解释性的,在实际运用中,各个步骤的执行主体与本申请的介绍可以存在不同。例如,上述管理节点可以执行策略控制节点对应的处理节点的配置方法;或者,由另外的节点执行数据处理任务的划分步骤等等,本申请实施例对此不作限定。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图4,其示出了本申请一个实施例提供的处理节点的管理装置的框图。该装置具有实现上述处理节点的管理方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是计算机设备,也可以设置计算机设备中。该装置400可以包括:信息获取模块401、节点启用模块402、策略调整模块403、任务确定模块404和指令发送模块405。
信息获取模块401,用于在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取所述异常处理节点的异常状态信息,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务。
节点启用模块402,用于若所述异常状态信息满足条件,则确定启用所述处理节点集群之外的辅助节点代替所述异常处理节点。
策略调整模块403,用于在确定启用所述辅助节点的情况下,调整所述数据处理任务的执行策略,所述执行策略用于指示针对所述数据处理任务的处理方式。
任务确定模块404,用于基于所述执行策略,确定所述辅助节点和剩余处理节点分别对应的数据处理子任务,所述剩余处理节点为所述处理节点集群中除所述异常处理节点之外的处理节点。
指令发送模块405,用于向所述辅助节点和所述剩余处理节点发送对应的任务执行指令,所述任务执行指令用于指示所述辅助节点和所述剩余处理节点执行对应的数据处理子任务。
在示例性实施例中,所述策略调整模块403,用于获取所述异常处理节点的数量;若所述异常处理节点的数量大于数量门限值,则确定执行任务重分片策略;其中,所述执行策略包括所述任务重分片策略,所述任务重分片策略是指对所述数据处理任务的未处理部分进行重新划分的策略。
在示例性实施例中,所述策略调整模块403,用于向所述异常处理节点发送进度查询请求,所述进度查询请求用于请求获取所述异常处理节点的任务处理进度;若接收到来自于所述异常处理节点的数据丢失响应,则确定执行二次计算策略,所述数据丢失响应用于指示所述异常处理节点对应的数据处理子任务的已处理数据丢失;其中,所述执行策略包 括所述二次计算策略,所述二次计算策略是指所述辅助节点重新执行所述异常处理节点对应的数据处理子任务的策略。
在示例性实施例中,所述任务确定模块404,用于基于所述执行策略和所述数据处理任务的处理进度,确定所述辅助节点的启用数量m,所述m为正整数;对所述数据处理任务的未处理部分进行划分,确定m个所述辅助节点和所述剩余处理节点分别对应的数据处理子任务。
在示例性实施例中,如图5所示,所述装置400还包括:原因确定模块406、时长确定模块407和节点确定模块408。
原因确定模块406,用于根据所述异常状态信息确定所述异常处理节点的异常原因。
时长确定模块407,用于根据所述异常原因确定所述异常处理节点的修复耗时。
节点确定模块408,用于若所述修复耗时大于阈值,则确定启用所述辅助节点代替所述异常处理节点。
在示例性实施例中,如图5所示,所述装置400还包括:次数记录模块409和状态切换模块410。
次数记录模块409,用于记录针对所述数据处理任务的执行策略调整次数。
状态切换模块410,用于响应于所述执行策略调整次数等于次数门限值,将所述数据处理任务的执行策略从可调整状态切换至不可调整状态。
在示例性实施例中,如图5所示,所述装置400还包括:节点修复模块411。
节点修复模块411,用于向所述异常处理节点发送修复指令,所述修复指令包括用于修复所述异常处理节点的修复数据;若接收到来自于所述异常处理节点的修复完成响应,确定所述异常处理节点从异常状态恢复至正常状态;向所述异常处理节点发送配置信息,所述配置信息用于将所述异常处理节点配置为所述辅助节点;若接收到来自于所述异常处理节点的配置完成响应,确定所述异常处理节点转换为所述辅助节点。
在示例性实施例中,如图5所示,所述装置400还包括:异常确定模块412。
异常确定模块412,用于获取所述处理节点集群中各处理节点发送的测量报告,所述测量报告包括任务处理信息和节点状态信息;其中,所述任务处理信息用于指示所述处理节点的任务处理进度,所述节点状态信息用于指示所述处理节点的工作状态;若来自于目标处理节点的测量报告存在异常,则确定所述目标处理节点为所述异常处理节点;其中,所述测量报告存在异常包括以下至少一项:所述测量报告中包含的任务处理信息存在异常、所述测量报告中包含的节点状态信息存在异常、超过设定时长未接收到所述测量报告。
综上所述,本申请实施例提供的技术方案中,在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取异常处理节点的异常状态信息。若根据异常状态 信息确定启用辅助节点时,对数据处理任务的执行策略进行调整,重新确定辅助节点和剩余处理节点分别对应的数据处理子任务,剩余处理节点为处理节点集群中除异常处理节点之外的处理节点,从而通过辅助节点代替异常处理节点,以便通过辅助节点和剩余处理节点协同执行数据处理任务,避免处理节点异常造成的数据处理任务失败,提高数据处理的可靠性,保证数据处理任务能够顺利完成。另外,该方法可以防止因数据处理任务分配不合理造成的数据处理效率低,有利于提高各个节点的数据处理效率,进而保证整个数据处理任务的处理效率。
请参考图6,其示出了本申请一个实施例提供的处理节点的配置装置的框图。该装置具有实现上述处理节点的配置方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是计算机设备,也可以设置计算机设备中。该装置600可以包括:任务获取模块601、信息确定模块602和节点配置模块603。
任务获取模块601,用于获取数据处理任务。
信息确定模块602,用于确定所述数据处理任务对应的任务信息,所述任务信息是指所述数据处理任务在执行过程中针对于数据的处理情况的相关信息。
节点配置模块603,用于根据所述任务信息为所述数据处理任务配置处理节点集群,以及除所述处理节点集群之外的辅助节点。
其中,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;所述辅助节点用于在所述处理节点集群中存在异常处理节点的情况下,代替所述异常处理节点执行任务。
在示例性实施例中,所述任务信息包括并行计算加速度比、任务处理时长和数据处理量;所述节点配置模块603,用于根据所述任务处理时长和所述数据处理量,确定所述处理节点的数量;在所述并行计算加速度比达到上限值的情况下,确定所述处理节点和所述辅助节点之间的比例;依据所述处理节点和所述辅助节点之间的比例,根据所述处理节点的数量,确定所述辅助节点的数量。
在示例性实施例中,如图7所示,所述装置600还包括:数量获取模块604和数量确定模块605。
数量获取模块604,用于获取管理节点的最大管理数量,所述最大管理数量是指单个所述管理节点所能管理的所述处理节点的最大数量。
数量确定模块605,用于根据所述最大管理数量和所述处理节点的数量,确定所述管理节点的数量。
综上所述,本申请实施例提供的技术方案中,在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取异常处理节点的异常状态信息。若根据异常状态 信息确定启用辅助节点时,对数据处理任务的执行策略进行调整,重新确定辅助节点和剩余处理节点分别对应的数据处理子任务,剩余处理节点为处理节点集群中除异常处理节点之外的处理节点,从而通过辅助节点代替异常处理节点,以便通过辅助节点和剩余处理节点协同执行数据处理任务,避免处理节点异常造成的数据处理任务失败,提高数据处理的可靠性,保证数据处理任务能够顺利完成。另外,该方法可以防止因数据处理任务分配不合理造成的数据处理效率低,有利于提高各个节点的数据处理效率,进而保证整个数据处理任务的处理效率。
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图8,其示出了本申请一个实施例提供的计算机设备的结构框图。该计算机设备可用于实现上述处理节点的管理方法,或实现处理节点的配置方法的功能。具体来讲:
计算机设备800包括中央处理单元(Central Processing Unit,CPU)801、包括随机存取存储器(Random Access Memory,RAM)802和只读存储器(Read Only Memory,ROM)803的系统存储器804,以及连接系统存储器804和中央处理单元801的系统总线805。计算机设备800还包括帮助计算机内的各个器件之间传输信息的基本输入/输出(Input/Output,I/O)系统806,和用于存储操作系统813、应用程序814和其他程序模块812的大容量存储设备807。
基本输入/输出系统806包括有用于显示信息的显示器808和用于用户输入信息的诸如鼠标、键盘之类的输入设备809。其中显示器808和输入设备809都通过连接到系统总线805的输入输出控制器88连接到中央处理单元801。基本输入/输出系统806还可以包括输入输出控制器810以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器810还提供输出到显示屏、打印机或其他类型的输出设备。
大容量存储设备807通过连接到系统总线805的大容量存储控制器(未示出)连接到中央处理单元801。大容量存储设备807及其相关联的计算机可读介质为计算机设备800提供非易失性存储。也就是说,大容量存储设备807可以包括诸如硬盘或者CD-ROM(Compact Disc Read-Only Memory,只读光盘)驱动器之类的计算机可读介质(未示出)。
不失一般性,计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM(Erasable Programmable Read Only Memory,可擦除可编程只读存储器)、EEPROM (Electrically Erasable Programmable Read Only Memory,电可擦可编程只读存储器)、闪存或其他固态存储其技术,CD-ROM、DVD(Digital Video Disc,高密度数字视频光盘)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知计算机存储介质不局限于上述几种。上述的系统存储器804和大容量存储设备807可以统称为存储器。
根据本申请的各种实施例,计算机设备800还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备800可以通过连接在系统总线805上的网络接口单元811连接到网络812,或者说,也可以使用网络接口单元811来连接到其他类型的网络或远程计算机系统(未示出)。
所述存储器还包括计算机程序,该计算机程序存储于存储器中,且经配置以由一个或者一个以上处理器执行,以实现上述处理节点的管理方法,或实现上述处理节点的配置方法。
在示例性实施例中,还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或所述指令集在被处理器执行时以实现上述处理节点的管理方法,或实现上述处理节点的配置方法。
可选地,该计算机可读存储介质可以包括:ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取记忆体)、SSD(Solid State Drives,固态硬盘)或光盘等。其中,随机存取记忆体可以包括ReRAM(Resistance Random Access Memory,电阻式随机存取记忆体)和DRAM(Dynamic Random Access Memory,动态随机存取存储器)。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述处理节点的管理方法,或执行上述处理节点的配置方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。另外,本文中描述的步骤编号,仅示例性示出了步骤间的一种可能的执行先后顺序,在一些其它实施例中,上述步骤也可以不按照编号顺序来执行,如两个不同编号的步骤同时执行,或者两个不同编号的步骤按照与图示相反的顺序执行,本申请实施例对此不作限定。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种处理节点的管理方法,所述方法由计算机设备执行,所述方法包括:
    在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取所述异常处理节点的异常状态信息,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;
    若所述异常状态信息满足条件,则确定启用所述处理节点集群之外的辅助节点代替所述异常处理节点;
    在确定启用所述辅助节点的情况下,调整所述数据处理任务的执行策略,所述执行策略用于指示针对所述数据处理任务的处理方式;
    基于所述执行策略,确定所述辅助节点和剩余处理节点分别对应的数据处理子任务,所述剩余处理节点为所述处理节点集群中除所述异常处理节点之外的处理节点;
    向所述辅助节点和所述剩余处理节点发送对应的任务执行指令,所述任务执行指令用于指示所述辅助节点和所述剩余处理节点执行对应的数据处理子任务。
  2. 根据权利要求1所述的方法,所述调整所述数据处理任务的执行策略,包括:
    获取所述异常处理节点的数量;
    若所述异常处理节点的数量大于数量门限值,则确定执行任务重分片策略;
    其中,所述执行策略包括所述任务重分片策略,所述任务重分片策略是指对所述数据处理任务的未处理部分进行重新划分的策略。
  3. 根据权利要求1所述的方法,所述调整所述数据处理任务的执行策略,包括:
    向所述异常处理节点发送进度查询请求,所述进度查询请求用于请求获取所述异常处理节点的任务处理进度;
    若接收到来自于所述异常处理节点的数据丢失响应,则确定执行二次计算策略,所述数据丢失响应用于指示所述异常处理节点对应的数据处理子任务的已处理数据丢失;
    其中,所述执行策略包括所述二次计算策略,所述二次计算策略是指所述辅助节点重新执行所述异常处理节点对应的数据处理子任务的策略。
  4. 根据权利要求1所述的方法,所述基于所述执行策略,确定所述辅助节点和剩余处理节点分别对应的数据处理子任务,包括:
    基于所述执行策略和所述数据处理任务的处理进度,确定所述辅助节点的启用数量m,所述m为正整数;
    对所述数据处理任务的未处理部分进行划分,确定m个所述辅助节点和所述剩余处理节点分别对应的数据处理子任务。
  5. 根据权利要求1所述的方法,所述若所述异常状态信息满足条件,则确定启用所述处理节点集群之外的辅助节点代替所述异常处理节点,包括:
    根据所述异常状态信息确定所述异常处理节点的异常原因;
    根据所述异常原因确定所述异常处理节点的修复耗时;
    若所述修复耗时大于阈值,则确定启用所述辅助节点代替所述异常处理节点。
  6. 根据权利要求1所述的方法,所述调整所述数据处理任务的执行策略之后,所述方法还包括:
    记录针对所述数据处理任务的执行策略调整次数;
    响应于所述执行策略调整次数等于次数门限值,将所述数据处理任务的执行策略从可调整状态切换至不可调整状态。
  7. 根据权利要求1所述的方法,所述确定启用所述处理节点集群之外的辅助节点代替所述异常处理节点之后,所述方法还包括:
    向所述异常处理节点发送修复指令,所述修复指令包括用于修复所述异常处理节点的修复数据;
    若接收到来自于所述异常处理节点的修复完成响应,确定所述异常处理节点从异常状态恢复至正常状态;
    向所述异常处理节点发送配置信息,所述配置信息用于将所述异常处理节点配置为所述辅助节点;
    若接收到来自于所述异常处理节点的配置完成响应,确定所述异常处理节点转换为所述辅助节点。
  8. 根据权利要求1至7任一项所述的方法,所述获取所述异常处理节点的异常状态信息之前,所述方法还包括:
    获取所述处理节点集群中各处理节点发送的测量报告,所述测量报告包括任务处理信息和节点状态信息;其中,所述任务处理信息用于指示所述处理节点的任务处理进度,所述节点状态信息用于指示所述处理节点的工作状态;
    若来自于目标处理节点的测量报告存在异常,则确定所述目标处理节点为所述异常处理节点;
    其中,所述测量报告存在异常包括以下至少一项:所述测量报告中包含的任务处理信息存在异常、所述测量报告中包含的节点状态信息存在异常、超过设定时长未接收到所述测量报告。
  9. 一种处理节点的配置方法,所述方法由计算机设备执行,所述方法包括:
    获取数据处理任务;
    确定所述数据处理任务对应的任务信息,所述任务信息是指所述数据处理任务在执行过程中针对于数据的处理情况的相关信息;
    根据所述任务信息为所述数据处理任务配置处理节点集群,以及除所述处理节点集群之外的辅助节点;
    其中,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;所述辅助节点用于在所述处理节点集群中存在异常处理节点的情况下,代替所述异常处理节点执行任务。
  10. 根据权利要求9所述的方法,所述任务信息包括并行计算加速度比、任务处理时长和数据处理量;
    所述根据所述任务信息为所述数据处理任务配置处理节点集群,以及除所述处理节点集群之外的辅助节点,包括:
    根据所述任务处理时长和所述数据处理量,确定所述处理节点的数量;
    在所述并行计算加速度比达到上限值的情况下,确定所述处理节点和所述辅助节点之间的比例;
    依据所述处理节点和所述辅助节点之间的比例,根据所述处理节点的数量,确定所述辅助节点的数量。
  11. 根据权利要求9所述的方法,所述方法还包括:
    获取管理节点的最大管理数量,所述最大管理数量是指单个所述管理节点所能管理的所述处理节点的最大数量;
    根据所述最大管理数量和所述处理节点的数量,确定所述管理节点的数量。
  12. 一种处理节点的管理装置,所述装置部署在计算机设备上,所述装置包括:
    信息获取模块,用于在检测到数据处理任务对应的处理节点集群中存在异常处理节点的情况下,获取所述异常处理节点的异常状态信息,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;
    节点启用模块,用于若所述异常状态信息满足条件,则确定启用所述处理节点集群之外的辅助节点代替所述异常处理节点;
    策略调整模块,用于在确定启用所述辅助节点的情况下,调整所述数据处理任务的执行策略,所述执行策略用于指示针对所述数据处理任务的处理方式;
    任务确定模块,用于基于所述执行策略,确定所述辅助节点和剩余处理节点分别对应的数据处理子任务,所述剩余处理节点为所述处理节点集群中除所述异常处理节点之外的处理节点;
    指令发送模块,用于向所述辅助节点和所述剩余处理节点发送对应的任务执行指令,所述任务执行指令用于指示所述辅助节点和所述剩余处理节点执行对应的数据处理子任务。
  13. 一种处理节点的配置装置,所述装置部署在计算机设备上,所述装置包括:
    任务获取模块,用于获取数据处理任务;
    信息确定模块,用于确定所述数据处理任务对应的任务信息,所述任务信息是指所述数据处理任务在执行过程中针对于数据的处理情况的相关信息;
    节点配置模块,用于根据所述任务信息为所述数据处理任务配置处理节点集群,以及除所述处理节点集群之外的辅助节点;
    其中,所述处理节点集群中包括多个处理节点,所述多个处理节点用于协同执行所述数据处理任务;所述辅助节点用于在所述处理节点集群中存在异常处理节点的情况下,代替所述异常处理节点执行任务。
  14. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至8任一项所述的处理节点的管理方法,或实现如权利要求9至11任一项所述的处理节点的配置方法。
  15. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至8任一项所述的处理节点的管理方法,或实现如权利要求9至11任一项所述的处理节点的配置方法。
  16. 一种计算机程序产品,当所述计算机程序产品被执行时,用于执行如权利要求1至8任一项所述的处理节点的管理方法,或实现如权利要求9至11任一项所述的处理节点的配置方法。
PCT/CN2021/097956 2020-07-08 2021-06-02 处理节点的管理方法、配置方法及相关装置 WO2022007552A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/743,837 US20220269564A1 (en) 2020-07-08 2022-05-13 Processing node management method, configuration method, and related apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010652008.1A CN111818159B (zh) 2020-07-08 2020-07-08 数据处理节点的管理方法、装置、设备及存储介质
CN202010652008.1 2020-07-08

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/743,837 Continuation US20220269564A1 (en) 2020-07-08 2022-05-13 Processing node management method, configuration method, and related apparatus

Publications (1)

Publication Number Publication Date
WO2022007552A1 true WO2022007552A1 (zh) 2022-01-13

Family

ID=72842940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097956 WO2022007552A1 (zh) 2020-07-08 2021-06-02 处理节点的管理方法、配置方法及相关装置

Country Status (3)

Country Link
US (1) US20220269564A1 (zh)
CN (1) CN111818159B (zh)
WO (1) WO2022007552A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638548A (zh) * 2022-05-09 2022-06-17 浙江国利网安科技有限公司 一种工业控制系统的风控方法、装置及电子设备

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111818159B (zh) * 2020-07-08 2024-04-05 腾讯科技(深圳)有限公司 数据处理节点的管理方法、装置、设备及存储介质
CN112202687B (zh) * 2020-12-03 2021-05-25 苏州浪潮智能科技有限公司 一种节点同步方法、装置、设备及存储介质
CN112965791B (zh) * 2021-03-29 2022-06-07 北京三快在线科技有限公司 定时任务检测方法、装置、设备及存储介质
CN113687834B (zh) * 2021-10-27 2022-02-18 深圳华锐金融技术股份有限公司 分布式系统节点部署方法、装置、设备及介质
CN114327817A (zh) * 2021-12-22 2022-04-12 马上消费金融股份有限公司 一种任务分片方法、装置和电子设备
CN114567471B (zh) * 2022-02-22 2022-10-28 珠海市鸿瑞信息技术股份有限公司 一种基于5g的电力通信网络安全检测系统及方法
CN115103001B (zh) * 2022-05-10 2024-03-08 航天国政信息技术(北京)有限公司 一种通信方法、装置及电子设备
CN115118473B (zh) * 2022-06-20 2023-07-14 中国联合网络通信集团有限公司 数据处理方法、装置、设备及存储介质
CN116743791B (zh) * 2022-09-30 2024-06-18 腾讯云计算(北京)有限责任公司 一种地铁云平台云边同步方法、装置、设备及存储介质
CN116074387A (zh) * 2023-03-16 2023-05-05 中国工商银行股份有限公司 服务请求的处理方法、装置和计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108768729A (zh) * 2018-05-31 2018-11-06 郑州云海信息技术有限公司 一种基于hdfs集群的存储节点的转移方法及装置
CN111381972A (zh) * 2018-12-27 2020-07-07 北京奇虎科技有限公司 分布式任务调度方法、装置和系统
CN111459642A (zh) * 2020-04-08 2020-07-28 广州欢聊网络科技有限公司 一种分布式系统中故障处理和任务处理方法及装置
CN111818159A (zh) * 2020-07-08 2020-10-23 腾讯科技(深圳)有限公司 数据处理节点的管理方法、装置、设备及存储介质

Family Cites Families (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050022202A1 (en) * 2003-07-09 2005-01-27 Sun Microsystems, Inc. Request failover mechanism for a load balancing system
WO2008113986A2 (en) * 2007-03-16 2008-09-25 British Telecommunications Public Limited Company Data transmission scheduler
CA2706119A1 (en) * 2007-11-08 2009-05-14 Antoine Blondeau Distributed network for performing complex algorithms
US8055933B2 (en) * 2009-07-21 2011-11-08 International Business Machines Corporation Dynamic updating of failover policies for increased application availability
US9588994B2 (en) * 2012-03-02 2017-03-07 International Business Machines Corporation Transferring task execution in a distributed storage and task network
US9223626B2 (en) * 2012-08-30 2015-12-29 International Business Machines Corporation Task execution and management in a clustered computing environment
CN102999385B (zh) * 2012-11-06 2016-05-25 国网山东省电力公司枣庄供电公司 计算设备中多处理器协同处理方法
CN103324539B (zh) * 2013-06-24 2017-05-24 浪潮电子信息产业股份有限公司 一种作业调度管理系统及方法
CN103617086B (zh) * 2013-11-20 2017-02-08 东软集团股份有限公司 一种并行计算方法及系统
US10284487B2 (en) * 2014-04-25 2019-05-07 Paypal, Inc. Software load balancer to maximize utilization
US9690675B2 (en) * 2014-07-17 2017-06-27 Cohesity, Inc. Dynamically changing members of a consensus group in a distributed self-healing coordination service
CN104461752B (zh) * 2014-11-21 2018-09-18 浙江宇视科技有限公司 一种两级故障容错的多媒体分布式任务处理方法
US9888063B2 (en) * 2014-12-10 2018-02-06 International Business Machines Corporation Combining application and data tiers on different platforms to create workload distribution recommendations
US10089197B2 (en) * 2014-12-16 2018-10-02 Intel Corporation Leverage offload programming model for local checkpoints
US9785480B2 (en) * 2015-02-12 2017-10-10 Netapp, Inc. Load balancing and fault tolerant service in a distributed data system
CN106155770B (zh) * 2015-03-30 2019-11-26 联想(北京)有限公司 任务调度方法和电子设备
CN105335251B (zh) * 2015-09-23 2018-11-02 浪潮(北京)电子信息产业有限公司 一种故障恢复方法及系统
US10719353B2 (en) * 2016-09-23 2020-07-21 Sap Se Handling failovers at one or more nodes in a distributed database system
CN106713944A (zh) * 2016-12-30 2017-05-24 北京奇虎科技有限公司 一种流数据任务的处理方法和装置
CN107092522B (zh) * 2017-03-30 2020-07-21 阿里巴巴集团控股有限公司 实时数据的计算方法及装置
CN107105032B (zh) * 2017-04-20 2019-08-06 腾讯科技(深圳)有限公司 节点设备运行方法及节点设备
CN109976883A (zh) * 2017-12-27 2019-07-05 深圳市优必选科技有限公司 一种任务的处理方法及其系统
CN108304255A (zh) * 2017-12-29 2018-07-20 北京城市网邻信息技术有限公司 分布式任务调度方法及装置、电子设备及可读存储介质
CN109343939B (zh) * 2018-07-31 2022-01-07 国家电网有限公司 一种分布式集群及并行计算任务调度方法
CN111090502B (zh) * 2018-10-24 2024-05-17 阿里巴巴集团控股有限公司 一种流数据任务调度方法和装置
CN110012062B (zh) * 2019-02-22 2022-02-08 北京奇艺世纪科技有限公司 一种多机房任务调度方法、装置及存储介质
US10990464B1 (en) * 2019-09-04 2021-04-27 Amazon Technologies, Inc. Block-storage service supporting multi-attach and health check failover mechanism
CN110677282B (zh) * 2019-09-23 2022-05-17 天津津航计算技术研究所 一种分布式系统的热备份方法及分布式系统
CN110716827B (zh) * 2019-09-23 2023-04-28 天津津航计算技术研究所 适用于分布式系统的热备份方法及分布式系统
CN110727508A (zh) * 2019-10-24 2020-01-24 无锡京和信息技术有限公司 一种任务调度系统和调度方法
CN111181774A (zh) * 2019-12-13 2020-05-19 苏州浪潮智能科技有限公司 一种MapReduce任务的高可用方法、系统、终端及存储介质
CN111160810A (zh) * 2020-01-09 2020-05-15 中国地质大学(武汉) 基于工作流的高性能分布式空间分析任务调度方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108768729A (zh) * 2018-05-31 2018-11-06 郑州云海信息技术有限公司 一种基于hdfs集群的存储节点的转移方法及装置
CN111381972A (zh) * 2018-12-27 2020-07-07 北京奇虎科技有限公司 分布式任务调度方法、装置和系统
CN111459642A (zh) * 2020-04-08 2020-07-28 广州欢聊网络科技有限公司 一种分布式系统中故障处理和任务处理方法及装置
CN111818159A (zh) * 2020-07-08 2020-10-23 腾讯科技(深圳)有限公司 数据处理节点的管理方法、装置、设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638548A (zh) * 2022-05-09 2022-06-17 浙江国利网安科技有限公司 一种工业控制系统的风控方法、装置及电子设备

Also Published As

Publication number Publication date
CN111818159B (zh) 2024-04-05
CN111818159A (zh) 2020-10-23
US20220269564A1 (en) 2022-08-25

Similar Documents

Publication Publication Date Title
WO2022007552A1 (zh) 处理节点的管理方法、配置方法及相关装置
WO2020207266A1 (zh) 网络系统、实例管控方法、设备及存储介质
WO2020147330A1 (zh) 一种数据流处理方法及系统
US7933995B2 (en) Computer program and apparatus for controlling computing resources, and distributed processing system
CN104753994B (zh) 基于集群服务器系统的数据同步方法及其装置
US9075659B2 (en) Task allocation in a computer network
US20180091588A1 (en) Balancing workload across nodes in a message brokering cluster
WO2020253079A1 (zh) 基于Jmeter的分布式性能测试方法、装置、设备及存储介质
CN105049268A (zh) 分布式计算资源分配系统和任务处理方法
TW201338537A (zh) 動態派工錄影系統與方法
WO2020063550A1 (zh) 策略决策方法及装置、系统、存储介质、策略决策单元及集群
US20210240575A1 (en) Dynamic backup management
CN113312153B (zh) 一种集群部署方法、装置、电子设备及存储介质
WO2023231398A1 (zh) 分布式处理系统的监控方法及装置
CN111400041A (zh) 服务器配置文件的管理方法、装置及计算机可读存储介质
JP6525761B2 (ja) ウェブサーバ、管理システム、およびその制御方法
CN111431951B (zh) 一种数据处理方法、节点设备、系统及存储介质
US11544091B2 (en) Determining and implementing recovery actions for containers to recover the containers from failures
US9575865B2 (en) Information processing system and monitoring method
WO2023065900A1 (zh) 设备状态消息处理方法及消息分发系统
CN113553194B (zh) 硬件资源管理方法、设备及存储介质
CN113672665A (zh) 数据处理方法、数据采集系统、电子设备和存储介质
CN113434278A (zh) 数据聚合系统、方法、电子设备及存储介质
EP3709173B1 (en) Distributed information memory system, method, and program
US9270530B1 (en) Managing imaging of multiple computing devices

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/06/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21837497

Country of ref document: EP

Kind code of ref document: A1