CN111181774A - High-availability method, system, terminal and storage medium for MapReduce task - Google Patents

High-availability method, system, terminal and storage medium for MapReduce task Download PDF

Info

Publication number
CN111181774A
CN111181774A CN201911283083.9A CN201911283083A CN111181774A CN 111181774 A CN111181774 A CN 111181774A CN 201911283083 A CN201911283083 A CN 201911283083A CN 111181774 A CN111181774 A CN 111181774A
Authority
CN
China
Prior art keywords
node
execution
standby
task
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911283083.9A
Other languages
Chinese (zh)
Inventor
道玉明
张东东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN201911283083.9A priority Critical patent/CN111181774A/en
Publication of CN111181774A publication Critical patent/CN111181774A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Abstract

The invention provides a high-availability method, a system, a terminal and a storage medium for a MapReduce task, which comprise the following steps: selecting a standby node from a cluster and storing the standby node; monitoring the state of an execution node of the MapReduce task; and if the state of the execution node is monitored to be abnormal, forwarding the task of the abnormal execution node to a standby node. The invention can ensure that the task is normally executed without interruption, thereby saving human resources, ensuring the product quality, taking log as record in the whole process, and being faster and more convenient for subsequent duplication.

Description

High-availability method, system, terminal and storage medium for MapReduce task
Technical Field
The invention relates to the technical field of big data Insight platforms, in particular to a high-availability method, a high-availability system, a high-availability terminal and a high-availability storage medium for a MapReduce task.
Background
In the big data Insight platform, MapReduce is one of the core components. The distribution of the Insight needs to comprise two parts, namely a distributed file system (HDFS) and a distributed computing framework (MapReduce), and the two parts are absent. The key on which the Insight component relies for performing tasks is MapReduce, which is therefore an important factor for large data platforms. The current MapReduce task is executed on a data node, but the execution of the task is influenced by various conditions such as downtime, unavailable network, poor performance and the like of the data node, so that the execution is failed, and the risk of log checking is avoided when the data node is down; the reason cannot be traced, the labor and time are wasted, and the cluster resources are wasted.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention provides a highly available method, system, terminal and storage medium for MapReduce task, so as to solve the above-mentioned technical problems.
In a first aspect, the present invention provides a highly available method for MapReduce task, including:
selecting a standby node from a cluster and storing the standby node;
monitoring the state of an execution node of the MapReduce task;
and if the state of the execution node is monitored to be abnormal, forwarding the task of the abnormal execution node to a standby node.
Further, the selecting a standby node from the cluster includes:
collecting performance parameters of all nodes of a cluster;
selecting a plurality of idle nodes with optimal performance parameters as standby nodes, wherein the number of the standby nodes is not less than the number of the execution nodes.
Further, the monitoring of the state of the execution node of the MapReduce task includes:
acquiring performance parameters of an execution node, wherein the performance parameters are I/O, Job, a disk, a CPU, a network, a memory, a power supply and weighted summation of running time;
judging whether the performance parameters of the execution nodes exceed a preset threshold value:
and if so, judging that the state of the execution node is abnormal.
Further, the method further comprises:
and saving the log storage path of the execution node.
And monitoring and storing the task execution progress of the execution node in real time.
Further, the forwarding the task of the abnormal execution node to the standby node includes:
forwarding the task execution progress and the task data of the abnormal execution node to a standby node;
and setting the log storage path of the abnormal execution node as the log storage path of the standby node.
In a second aspect, the present invention provides a high availability system for MapReduce task, including:
the standby selecting unit is configured to select a standby node from the cluster and store the standby node;
the state monitoring unit is configured to monitor the state of the execution node of the MapReduce task;
and the task forwarding unit is configured to forward the task of the abnormal execution node to the standby node if the abnormal state of the execution node is monitored.
Further, the spare selecting unit includes:
the parameter acquisition module is configured for acquiring performance parameters of all nodes of the cluster;
and the node screening module is configured to select a plurality of idle nodes with optimal performance parameters as standby nodes, wherein the number of the standby nodes is not less than the number of the execution nodes.
Further, the node monitoring unit includes:
the parameter calculation module is configured to acquire performance parameters of the execution node, wherein the performance parameters are weighted summation of I/O, Job, a disk, a CPU, a network, a memory, a power supply and running time;
the parameter judgment module is configured to judge whether the performance parameter of the execution node exceeds a preset threshold value;
and the abnormity determining module is configured for determining that the state of the execution node is abnormal if the performance parameter of the execution node exceeds a preset threshold value.
In a third aspect, a terminal is provided, including:
a processor, a memory, wherein,
the memory is used for storing a computer program which,
the processor is used for calling and running the computer program from the memory so as to make the terminal execute the method of the terminal.
In a fourth aspect, a computer storage medium is provided having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.
The beneficial effect of the invention is that,
according to the high-availability method, the system, the terminal and the storage medium for the MapReduce task, provided by the invention, the standby node is selected from the cluster, the state of the execution node of the MapReduce task is monitored in real time, and the task of the abnormal execution node is transferred to the standby node once the state of the execution node is abnormal, so that the high-availability of the MapReduce task is realized. The invention can ensure that the task is normally executed without interruption, thereby saving human resources, ensuring the product quality, taking log as record in the whole process, and being faster and more convenient for subsequent duplication.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. Among them, the execution subject of fig. 1 may be a highly available system for a MapReduce task.
As shown in fig. 1, the method 100 includes:
step 110, selecting a standby node from the cluster and storing the standby node;
step 120, monitoring the state of the execution node of the MapReduce task;
step 130, if the execution node state is monitored to be abnormal, forwarding the task of the abnormal execution node to the standby node.
In order to facilitate understanding of the present invention, the high availability method of the MapReduce task provided by the present invention is further described below by using the principle of the high availability method of the MapReduce task of the present invention and combining with the process of scheduling and managing the MapReduce task in the embodiment.
Specifically, the highly available method for the MapReduce task comprises the following steps:
and S1, selecting a standby node from the cluster and storing the standby node.
The task data node recommendation module: and saving the data nodes recommended by the MapReduce task data node recommending module as standby nodes, wherein the recommended nodes are optimal performance nodes. And the number of the selected standby nodes is not less than the sum of the number of the nodes currently executing the MapReduce task.
The number of the standby nodes set in this embodiment is the best implementation, and in other implementations, the number of the standby nodes may be set by itself as needed.
And S2, monitoring the state of the execution node of the MapReduce task.
Monitoring indexes of I/O, Job, a disk, a CPU, a network, a memory, a power supply and operation time of a current MapReduce task execution node, calculating comprehensive parameters of the execution node by using the collected index parameters, wherein the calculation method is to carry out weighted summation on the index parameters, and the weight of each index parameter is set according to the performance requirement of each index on the task executed by the server. And when the comprehensive parameters of the execution nodes exceed the set threshold, judging that the state of the execution nodes is abnormal.
And S3, if the state of the execution node is monitored to be abnormal, forwarding the task of the abnormal execution node to a standby node.
And storing paths of executing logs of all current MapReduce tasks, and providing that after the MapReduce task switching data nodes are executed after the MapReduce task forwarding module is triggered, the logs can be continuously written in the paths.
And monitoring the current task execution progress, and ensuring that the task is continuously executed instead of being executed from the beginning after the MapReduce task forwarding module executes.
When the execution node in the abnormal state is monitored in step S2, a standby node is randomly selected, the task execution progress and the task data of the execution node in the abnormal state are forwarded to the selected standby node, and the log storage path of the abnormal execution node is set as the log storage path of the standby node. And controlling the standby node to continuously execute the task of the abnormal state execution node.
As shown in fig. 2, the system 200 includes:
a standby selecting unit 210 configured to select a standby node from the cluster and store the standby node;
the state monitoring unit 220 is configured to monitor the state of the execution node of the MapReduce task;
and the task forwarding unit 230 is configured to forward the task of the abnormal execution node to the standby node if it is monitored that the state of the execution node is abnormal.
Optionally, as an embodiment of the present invention, the spare selecting unit includes:
the parameter acquisition module is configured for acquiring performance parameters of all nodes of the cluster;
and the node screening module is configured to select a plurality of idle nodes with optimal performance parameters as standby nodes, wherein the number of the standby nodes is not less than the number of the execution nodes.
Optionally, as an embodiment of the present invention, the node monitoring unit includes:
the parameter calculation module is configured to acquire performance parameters of the execution node, wherein the performance parameters are weighted summation of I/O, Job, a disk, a CPU, a network, a memory, a power supply and running time;
the parameter judgment module is configured to judge whether the performance parameter of the execution node exceeds a preset threshold value;
and the abnormity determining module is configured for determining that the state of the execution node is abnormal if the performance parameter of the execution node exceeds a preset threshold value.
Fig. 3 is a schematic structural diagram of a terminal system 300 according to an embodiment of the present invention, where the terminal system 300 may be used to execute a high availability method for a MapReduce task according to the embodiment of the present invention.
The terminal system 300 may include: a processor 310, a memory 320, and a communication unit 330. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.
The memory 320 may be used for storing instructions executed by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 320, when executed by processor 310, enable terminal 300 to perform some or all of the steps in the method embodiments described below.
The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 310 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.
A communication unit 330, configured to establish a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.
The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Therefore, the standby nodes are selected from the cluster, the states of the execution nodes of the MapReduce task are monitored in real time, and the task of the abnormal execution node is transferred to the standby nodes once the states of the execution nodes are abnormal, so that the high availability of the MapReduce task is realized. The invention can ensure that the task is normally executed without interruption, namely, the human resources are saved, the product quality is ensured, the log is taken as the record in the whole process, the subsequent duplication is faster and more convenient, the technical effect which can be achieved by the embodiment can be referred to the description above, and the details are not repeated here.
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, where the computer software product is stored in a storage medium, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, and the storage medium can store program codes, and includes instructions for enabling a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, and the like) to perform all or part of the steps of the method in the embodiments of the present invention.
The same and similar parts in the various embodiments in this specification may be referred to each other. Especially, for the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.
In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A high-availability method for a MapReduce task is characterized by comprising the following steps:
selecting a standby node from a cluster and storing the standby node;
monitoring the state of an execution node of the MapReduce task;
and if the state of the execution node is monitored to be abnormal, forwarding the task of the abnormal execution node to a standby node.
2. The method of claim 1, wherein selecting the standby node from the cluster comprises:
collecting performance parameters of all nodes of a cluster;
selecting a plurality of idle nodes with optimal performance parameters as standby nodes, wherein the number of the standby nodes is not less than the number of the execution nodes.
3. The method according to claim 1, wherein the monitoring of the state of the executing node of the MapReduce task comprises:
acquiring performance parameters of an execution node, wherein the performance parameters are I/O, Job, a disk, a CPU, a network, a memory, a power supply and weighted summation of running time;
judging whether the performance parameters of the execution nodes exceed a preset threshold value:
and if so, judging that the state of the execution node is abnormal.
4. The method of claim 1, further comprising:
and saving the log storage path of the execution node.
And monitoring and storing the task execution progress of the execution node in real time.
5. The method of claim 4, wherein forwarding the task of the abnormal execution node to the standby node comprises:
forwarding the task execution progress and the task data of the abnormal execution node to a standby node;
and setting the log storage path of the abnormal execution node as the log storage path of the standby node.
6. A high availability system for a MapReduce task, comprising:
the standby selecting unit is configured to select a standby node from the cluster and store the standby node;
the state monitoring unit is configured to monitor the state of the execution node of the MapReduce task;
and the task forwarding unit is configured to forward the task of the abnormal execution node to the standby node if the abnormal state of the execution node is monitored.
7. The system of claim 6, wherein the alternate picking unit comprises:
the parameter acquisition module is configured for acquiring performance parameters of all nodes of the cluster;
and the node screening module is configured to select a plurality of idle nodes with optimal performance parameters as standby nodes, wherein the number of the standby nodes is not less than the number of the execution nodes.
8. The system of claim 6, wherein the node monitoring unit comprises:
the parameter calculation module is configured to acquire performance parameters of the execution node, wherein the performance parameters are weighted summation of I/O, Job, a disk, a CPU, a network, a memory, a power supply and running time;
the parameter judgment module is configured to judge whether the performance parameter of the execution node exceeds a preset threshold value;
and the abnormity determining module is configured for determining that the state of the execution node is abnormal if the performance parameter of the execution node exceeds a preset threshold value.
9. A terminal, comprising:
a processor;
a memory for storing instructions for execution by the processor;
wherein the processor is configured to perform the method of any one of claims 1-5.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN201911283083.9A 2019-12-13 2019-12-13 High-availability method, system, terminal and storage medium for MapReduce task Withdrawn CN111181774A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911283083.9A CN111181774A (en) 2019-12-13 2019-12-13 High-availability method, system, terminal and storage medium for MapReduce task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911283083.9A CN111181774A (en) 2019-12-13 2019-12-13 High-availability method, system, terminal and storage medium for MapReduce task

Publications (1)

Publication Number Publication Date
CN111181774A true CN111181774A (en) 2020-05-19

Family

ID=70648855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911283083.9A Withdrawn CN111181774A (en) 2019-12-13 2019-12-13 High-availability method, system, terminal and storage medium for MapReduce task

Country Status (1)

Country Link
CN (1) CN111181774A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111818159A (en) * 2020-07-08 2020-10-23 腾讯科技(深圳)有限公司 Data processing node management method, device, equipment and storage medium
CN111813565A (en) * 2020-09-15 2020-10-23 北京东方通科技股份有限公司 Method and system for balancing workload in a grid computing environment
CN113127310A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Task processing method and device, electronic equipment and storage medium
CN114039836A (en) * 2021-11-05 2022-02-11 光大科技有限公司 Fault processing method and device for Exporter collector

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111818159A (en) * 2020-07-08 2020-10-23 腾讯科技(深圳)有限公司 Data processing node management method, device, equipment and storage medium
CN111818159B (en) * 2020-07-08 2024-04-05 腾讯科技(深圳)有限公司 Management method, device, equipment and storage medium of data processing node
CN111813565A (en) * 2020-09-15 2020-10-23 北京东方通科技股份有限公司 Method and system for balancing workload in a grid computing environment
CN113127310A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Task processing method and device, electronic equipment and storage medium
CN113127310B (en) * 2021-04-30 2023-09-01 北京奇艺世纪科技有限公司 Task processing method and device, electronic equipment and storage medium
CN114039836A (en) * 2021-11-05 2022-02-11 光大科技有限公司 Fault processing method and device for Exporter collector

Similar Documents

Publication Publication Date Title
CN111181774A (en) High-availability method, system, terminal and storage medium for MapReduce task
CN113014634B (en) Cluster election processing method, device, equipment and storage medium
US20150115711A1 (en) Multi-level data center consolidated power control
CN104065741A (en) Data collection system and method
EP3201717B1 (en) Monitoring of shared server set power supply units
CN107451147A (en) A kind of method and apparatus of kafka clusters switching at runtime
CN110851320A (en) Server downtime supervision method, system, terminal and storage medium
CN110727556A (en) BMC health state monitoring method, system, terminal and storage medium
CN113656168A (en) Method, system, medium and equipment for automatic disaster recovery and scheduling of traffic
CN111181780A (en) HA cluster-based host pool switching method, system, terminal and storage medium
CN112737800A (en) Service node fault positioning method, call chain generation method and server
CN115145769A (en) Intelligent network card and power supply method, device and medium thereof
CN103634167B (en) Security configuration check method and system for target hosts in cloud environment
CN108376110A (en) A kind of automatic testing method, system and terminal device
US10169138B2 (en) System and method for self-healing a database server in a cluster
CN112732408A (en) Method for computing node resource optimization
CN112492011A (en) Distributed storage system fault switching method, system, terminal and storage medium
CN111062503A (en) Power grid monitoring alarm processing method, system, terminal and storage medium
Devi et al. Multi level fault tolerance in cloud environment
CN112363826B (en) Project resource comprehensive management system, method, terminal and storage medium
CN110703988B (en) Storage pool creating method, system, terminal and storage medium for distributed storage
CN113242302A (en) Data access request processing method and device, computer equipment and medium
CN112131077A (en) Fault node positioning method and device and database cluster system
CN111949216A (en) Method, system, terminal and storage medium for automatically expanding storage volume of cloud platform
CN113254245A (en) Fault detection method and system for storage cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200519