CN111930493A

CN111930493A - NodeManager state management method and device in cluster and computing equipment

Info

Publication number: CN111930493A
Application number: CN201910394996.1A
Authority: CN
Inventors: 李瑶; 许佳
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Hubei Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Hubei Co Ltd
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2020-11-13
Anticipated expiration: 2039-05-13
Also published as: CN111930493B

Abstract

The embodiment of the invention relates to the technical field of distributed resource management and scheduling systems, and discloses a method, a device and a computing device for managing NodeManager states in a cluster. The method comprises the following steps: collecting network load information of a cluster, and evaluating the hardware state of the cluster according to the network load information; determining the health state of the nodes in the cluster according to the evaluation result; and when the state of the node is unhealthy, performing offline operation on the node manager. Through the mode, the embodiment of the invention realizes the prejudgment and the automatic offline before the NodeManager fault, ensures the stable operation of the system, and simultaneously avoids the condition that the task fails due to the failure of the Container allocation caused by the occupation of a plurality of application programs by the node host.

Description

NodeManager state management method and device in cluster and computing equipment

Technical Field

The embodiment of the invention relates to the technical field of distributed resource management and scheduling systems, in particular to a method, a device and computing equipment for managing NodeManager states in a cluster.

Background

With the development of computer technology, various data-intensive application-based computing frameworks are emerging, such as MpaReduce, Spark, S4, Storm, etc. When a computing framework is adopted, factors such as resource utilization rate, operation and maintenance cost, data sharing and the like are generally considered, and an application person generally wants to deploy all the computing frameworks to a common cluster, share the cluster resources and uniformly use the resources. Thus, a unified Resource management and scheduling platform, typically YARN (Yet other Resource coordinator), was created.

The YARN is divided into a resourcemanager (global resource manager, RM) and a NodeManager (node manager, NM) role, wherein the resourcemanager is mainly responsible for global allocation and management. The NodeManager is responsible for resource allocation and management of individual nodes. After receiving the task, the NodeManager can allocate Application Master and Container, and when the host resource is not exclusive to YARN, the situation that the ResourceManager resource Application fails can be caused.

In the prior art, YARN resource allocation only takes a CPU and a memory as computing resources, and is divided in advance in a yann-site.xml configuration form when a cluster is started, and a resource manager and a node manager maintain connection through heartbeat, and cannot make a judgment on a network so as to allocate resources. In addition, the Impala of the MPP architecture is also deployed on the host of the Hadoop cluster, but the resource allocation is not managed according to the YARN, and when the MPP aggregation query is executed, a large amount of data is accumulated in the memory, and at this time, if the MPP aggregation query is continuously applied according to the memory and the CPU in the configuration, a Container allocation failure is caused, and thus a task failure is caused. The memory occupied by the instant query is high, but the use time is short. If all reservations are made, it would be wasteful of YARN. Therefore, this approach cannot accommodate the situation where multiple applications are preempted by the node host.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present invention provide a subscription database scaling method, apparatus, and computing device based on a TimesTen bank, which overcome the foregoing problems or at least partially solve the foregoing problems.

According to an aspect of an embodiment of the present invention, a method for managing NodeManager states in a cluster is provided, the method including:

collecting network load information of a cluster, and evaluating the hardware state of the cluster according to the network load information;

determining the health state of the nodes in the cluster according to the evaluation result;

and when the state of the node is unhealthy, performing offline operation on the node manager.

In an optional manner, the collecting network load information of a cluster, and evaluating a hardware state of the cluster according to the network load information, further includes:

collecting network load information of a cluster;

and evaluating the network delay of the cluster according to the network load information, and evaluating the disk state of the cluster.

In an optional manner, when the host resource is not exclusive to YARN, the method further comprises:

evaluating the CPU utilization rate and the memory utilization rate;

the determining the health status of the nodes in the cluster according to the result of the evaluation further comprises:

and determining the health state of the nodes in the cluster according to the evaluation results of the network delay, the disk state, the CPU utilization rate and the memory utilization rate.

In an optional manner, when the host resource is YARN exclusive, the method further comprises:

and when the network delay exceeds a preset value, evaluating the network delay of the cluster by combining the historical network delay and the health state record of the corresponding node.

In an optional manner, the method further comprises:

reconfiguring CPU resources and memory resources;

when the state health of the nodes in the cluster is determined according to the evaluation of the hardware state of the cluster, modifying the parameters of the NodeManager configuration file into the reconfigured values;

and carrying out online operation on the NodeManager.

In an optional manner, the evaluating the network delay of the cluster according to the network load information further includes:

acquiring request queuing time and processing time of an RPC queue through a JMX interface monitored by JMX in Hadoop;

summing the request queuing times of all the nodes, then averaging to obtain a reference queue time, and taking the processing time of the first host as a reference processing time;

judging whether the network delay of the first host is greater than the reference queue time or not, or whether the network delay of the second host is greater than the reference processing time or not;

and when the network delay of the first host is larger than the reference queue time or the network delay of the second host is larger than the reference processing time, determining that the state of the node is unhealthy.

In an optional manner, the evaluating the disk state of the cluster further includes:

checking the running state of the disk through a script;

judging whether the magnetic disk reports errors or not;

and when a certain disk in the disks of the cluster reports an error, determining that the state of the node is unhealthy.

In an optional manner, the evaluating the CPU usage further includes:

calculating the total core number N of the current CPU through a script, and determining the utilization rate p of the CPU used by the current non-YARN and the core number M of the CPU distributed by the NodeManager;

subtracting the product of N and (1-p) from M to obtain the evaluated value of the CPU utilization rate;

and when the evaluated value of the CPU utilization rate exceeds a preset CPU utilization rate threshold value, determining that the state of the node is unhealthy.

In an optional manner, the evaluating the memory usage further includes:

acquiring the total memory, the total memory allocated in the NodeManager and the use amount of the system process through the script;

judging whether the difference value between the total memory amount and the system process usage amount is larger than the total memory amount distributed in the NodeManager;

and when the difference value between the total memory amount and the system process usage amount is not greater than the total memory amount distributed in the NodeManager, determining that the state of the node is unhealthy.

According to another aspect of the embodiments of the present invention, there is provided a node manager state management apparatus in a cluster, the apparatus including:

the evaluation module is used for collecting network load information of a cluster and evaluating the hardware state of the cluster according to the network load information;

the determining module is used for determining the health state of the nodes in the cluster according to the evaluation result;

and the management module is used for performing offline operation on the NodeManager when the state of the node is unhealthy.

According to another aspect of embodiments of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation of the NodeManager state management method in the cluster.

According to another aspect of the embodiments of the present invention, there is provided a computer storage medium, in which at least one executable instruction is stored, and the executable instruction causes a processor to execute the method for managing node manager states in a cluster as described above.

The embodiment of the invention automatically collects and evaluates the hardware state of the cluster, determines the health state of the nodes in the cluster according to the evaluation result, and carries out offline operation on the NodeManager when the state of the nodes is unhealthy, thereby realizing the prejudgment and automatic offline of the NodeManager before the fault and ensuring the stable operation of the system; meanwhile, the embodiment of the invention does not only evaluate the health state of the node according to the states of the memory and the CPU in the configuration, thereby avoiding the condition that the task fails due to the failure of Container allocation when a node host is preempted by a plurality of application programs.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a flowchart of a NodeManager state management method in a cluster according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a NodeManager status management method in a cluster according to another embodiment of the present invention;

FIG. 3 is a flowchart illustrating a NodeManager status management method in a cluster according to another embodiment of the present invention;

FIG. 4 is a flowchart illustrating a NodeManager state management method in a cluster according to yet another embodiment of the present invention;

fig. 5 is a flowchart illustrating a method for managing NodeManager states in a cluster according to a specific application example in an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a node manager state management apparatus in a cluster according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a flowchart of a NodeManager status management method in a cluster, which is provided in an embodiment of the present invention and is applied to a computing device, for example, a server in a communication network, a management computer in a resource unified management and scheduling platform of a cluster, and the like. As shown in fig. 1, the method comprises the steps of:

step 110: collecting network load information of a cluster, and evaluating the hardware state of the cluster according to the network load information.

In this step, the hardware state includes network latency, disk state, and the like. Generally, when the host resource YARN is exclusive, the network state is judged through automatically collected network load information, and the hard disk state can be judged at the same time, so as to evaluate the hardware state of the cluster. The method further comprises the following steps:

step A1: collecting network load information of a cluster;

step A2: and evaluating the network delay of the cluster according to the network load information, and evaluating the disk state of the cluster.

Step 120: and determining the health state of the nodes in the cluster according to the evaluation result.

And judging whether the downtime risk exists according to the evaluation result, and if so, determining that the health state of the nodes in the cluster is unhealthy and needing further processing. The evaluation result may be whether the hardware state of the cluster meets a preset condition, and when the hardware state of the cluster meets the preset condition, the state of the node in the cluster is determined to be unhealthy. Or, the evaluation result may also be a score, and when the evaluated score is greater than or less than a preset threshold, the state of the node in the cluster is judged to be unhealthy. It is understood that, in step 110, the hardware status includes one or more hardware statuses, and in this case, if the result of the evaluation of a certain hardware status meets a preset condition or the score of the evaluation is greater than or less than a preset threshold, the node status of the cluster is determined to be unhealthy, without determining the health status of the node according to the evaluation result of the whole hardware.

Step 130: and when the state of the node is unhealthy, performing offline operation on the node manager.

The NodeManager is offline according to the condition of the current node without influencing the service, so that the stable operation of the system is guaranteed. It will be appreciated that when the condition is restored, the NodeManager may also be modified to the appropriate parameters and the line restored, as will be described in detail later.

Fig. 2 shows a flowchart of a method for managing node manager states in a cluster according to another embodiment of the present invention. This embodiment is the case where the host resource is not exclusive to YARN. As shown in fig. 2, the method comprises the steps of:

Step 210: when the host resource is not exclusive to YARN, the CPU usage and memory usage are evaluated.

And judging whether the host resource is YARN which exclusively belongs to the software process.

Step 120: and determining the health state of the nodes in the cluster according to the evaluation results of the network delay, the disk state, the CPU utilization rate and the memory utilization rate.

At this time, the evaluated items include a plurality of items, and when the evaluation result of one item meets a preset condition or the evaluation score is greater than or less than a preset threshold, the node status of the cluster can be determined to be unhealthy without determining the health status of the node according to the evaluation results of all the items. For example, the node status in the cluster may be determined to be unhealthy only if the network delay is greater than a preset threshold.

Step 110, step 120 and step 130 are the same as those in the foregoing embodiments, and reference may be made to the detailed description of the foregoing embodiments, which are not repeated herein.

In this embodiment, when the host resource is not monopolized by YARN, the utilization rates of the current CPU and memory resources are analyzed, and the priorities of other applications are fully considered, so that the status of the current node is reasonably evaluated, and the node is offline and recovered without affecting the service according to the status of the current node.

Fig. 3 is a flowchart illustrating a method for managing node manager states in a cluster according to another embodiment of the present invention. This embodiment is the case where the host resource is the sole YARN and the network latency is too large. As shown in fig. 3, the method comprises the steps of:

Wherein the hardware state includes a network delay.

Step 310: when the host resource is exclusive to the YARN and the network delay exceeds a preset value, the network delay of the cluster is evaluated by combining the historical network delay and the health state record of the corresponding node.

In this step, when the host resource is the sole YARN, if the current network traffic is too large, other services may occupy the bandwidth at this time, but not the node state is unhealthy, and if it is determined that the node state is unhealthy, the node manager is offline, which may result in unnecessary offline and reduce the system operation efficiency. Thus, reference may be made to historical information, including a record of various network delays and whether the node is healthy at that time. If the network delay exceeds a certain preset value, the historical network delay and the corresponding node health state record are combined, so that the network delay of the cluster is evaluated in an auxiliary mode. If a certain percentage (e.g., 80%) of the node conditions in the history are healthy under similar network delay conditions, then the network delay may be determined to be normal.

Step 120: determining a health status of nodes in the cluster according to a result of the evaluation of the network delay.

In this embodiment, when the host resource is monopolized by YARN and the current network traffic is too large, it may be that other services are occupying bandwidth, and at this time, whether to send a command of NodeManager offline is comprehensively determined according to a historical traffic peak value, so that offline error is avoided.

Fig. 4 is a flowchart illustrating a node manager state management method in a cluster according to yet another embodiment of the present invention. In this embodiment, after the NodeManager is offline, after the condition is recovered, the NodeManager is modified into a suitable parameter and the online condition is recovered. As shown in fig. 4, the method comprises the steps of:

Step 440: and reconfiguring the CPU resource and the memory resource.

This step can achieve dynamic allocation and utilization of resources by program modification of values of yarn.

The CPU resources may be reconfigured by:

1. obtaining the idle time of each CPU through a system stat command, and evaluating the utilization rate of the whole CPU;

2. calculating the time for removing the CPU used by the NodeManager;

3. obtaining an idle CPU ratio according to a proportion, and obtaining the core number which should be allocated to the CPU by combining the physical CPU core number N, wherein the idle CPU ratio is as follows: the CPU sum of the user, nic, system and idle is in proportion; the calculation formula of the core number to be allocated to the CPU is: pf₁+Pf₂+Pf₃+…+Pf_nWherein Pf₁Idle duty … … Pf referring to CPU core 1_nRefers to the idle duty cycle of CPU core n.

The memory resources may be reconfigured by:

1. counting-X preset when node manager Java process starts_mxValue M_xWherein X is_mxThe maximum heap memory occupied by starting the Java process is pointed;

2. counting the total memory amount M of the current system population_tTotal memory amount M occupied by current system_u；

3. Calculating to obtain the total memory amount M to be allocated_sThe calculation formula is as follows: m_s＝M_t-(M_x+M_u)

Step 450: and when the state health of the nodes in the cluster is determined according to the evaluation of the hardware state of the cluster, modifying the parameters of the NodeManager configuration file into the reconfigured values.

Step 460: and carrying out online operation on the NodeManager.

Xml configuration of the node can be modified according to the current running state of the node after offline, and when the node is online again, the memory and the CPU are allocated in a more flexible manner.

In the following, how to evaluate the network latency of the cluster according to the network load information, how to evaluate the disk state of the cluster, how to evaluate the CPU utilization, and how to evaluate the memory utilization will be further described in detail in the above embodiments.

In some embodiments, in the step a2, the evaluating the network delay of the cluster according to the network load information includes the following steps:

step A21: acquiring request queuing time and processing time of an RPC queue through a JMX interface monitored by JMX in Hadoop;

step A22: summing the request queuing times of all the nodes, then averaging to obtain a reference queue time, and taking the processing time of the first host as a reference processing time;

step A23: judging whether the network delay of the first host is greater than the reference queue time or not, or whether the network delay of the second host is greater than the reference processing time or not;

at this time, the determining the health status of the nodes in the cluster according to the evaluation result further includes:

Specifically, the network delay score of the first host and the network delay score of the second host can be calculated; wherein the network delay score of the first host is a difference between the network delay of the first host and the reference queue time when the network delay of the first host is greater than the reference queue time, otherwise it is 0: the network delay score of the second host is a difference between the network delay of the second host and the reference processing time when the network delay of the second host is greater than the reference processing time, otherwise, the network delay score of the second host is 0; and adding the network delay score of the first host and the network delay score of the second host to obtain the network delay score of the cluster. When the network delay score of the cluster is not 0, determining that the state of the node is unhealthy.

In this embodiment, rpcquetimeavgtime and rpcpprocessingtimeavgtime are collected through a JMX interface monitored by JMX in Hadoop, so as to obtain request queuing time and processing time of an RPC queue, and obtain average time of normal operation of a cluster, where the following formula may be referred to for specific calculation:

reference queue time Tq ═ Tq (Tq)₁+Tq₂+Tq₃+…+Tq_N)/N

Reference processing time Tp ═ Tp (Tp)₁+Tp₁+Tp₁+…+Tp₁)/N

The current network delay time is judged as follows: (t)₁-Tq)>0？(t₁-Tq):0+(t₂-Tp)>0？(t₂-Tp):0

Wherein N represents N nodes, t₁Indicates the network latency of host one, t₂Indicating the network latency of host two.

When the network delay time exceeds a preset value (for example, 0.8s), the node state is determined to be unhealthy.

In some embodiments, in the step a2, the evaluating the disk status of the cluster includes the following steps:

step a 21': checking the running state of the disk through a script;

step a 22': judging whether the magnetic disk reports errors or not;

Specifically, when a certain disk fails to report an error, the disk state score of the disk is 100; when a certain disk reports an error, the disk state score of the disk is 0; when the score of any disk state is 0, the evaluated score of the disk state of the cluster is 0, and when the scores of the states of all the disks are 100, the evaluated score of the disk state of the cluster is 100. When the evaluated score of the disk state of the cluster is 0, determining that the state of the node is unhealthy.

In this embodiment, the script may execute smartcll-H sdaN (linux self-contained check script) to check the running status of the disk, the score evaluated when the disk fails to report an error is 100 scores, the score evaluated when the disk reports an error is 0 score, and the final scores of n disks are:0∈(D₁,D₂,D₃,…,D_n) Is there a 0:100, wherein D₁Score … … D for disk 1_nThe score for disk n. When the disk score is 0, the node status is determined to be unhealthy.

In some embodiments, in the step a2, when the host resource is not exclusively YARN, the step of evaluating the CPU utilization rate includes the following steps:

step B1: calculating the total core number N of the current CPU through a script, and determining the utilization rate p of the CPU used by the current non-YARN and the core number M of the CPU distributed by the NodeManager;

step B2: subtracting the product of N and (1-p) from M to obtain the score of the CPU usage rate evaluation.

At this time, determining the health status of the nodes in the cluster according to the result of the evaluation of the CPU utilization, further comprising:

In this embodiment, the total core number N of the current CPU can be calculated by looking up/proc/stat (a file occupied by the current system is displayed by the Linux system), the CPU utilization p used by the current non-YARN and the core number M allocated by the NodeManager are determined, and the CPU score is M-N (1-p). When the CPU utilization score exceeds a preset value (e.g., 75% or 80%), the node status is determined to be unhealthy.

In some embodiments, in the step a2, when the host resource is not exclusively YARN, the evaluating the memory usage comprises the following steps:

step C1: acquiring the total memory, the total memory allocated in the NodeManager and the use amount of the system process through the script;

step C2: and judging whether the difference value between the total memory amount and the system process usage amount is larger than the total memory amount distributed in the NodeManager.

At this time, determining the health status of the nodes in the cluster according to the result of the evaluation of the memory usage rate, further comprising:

Specifically, when the difference between the total memory amount and the system process usage amount is greater than the total memory amount allocated in the NodeManager, the score of the memory usage rate evaluation is 100, otherwise, the score of the memory usage rate evaluation is 0; and when the evaluated value of the memory utilization rate is 0, determining that the state of the node is unhealthy.

In this embodiment, the total memory amem, the total memory nmem allocated in the Nodemanager, and the system process usage amount smem are obtained from a/proc/meminfo (file occupied by the current system displayed by the Linux system) file. Then the memory score is amem-smem > nmem? 100: 0. and when the memory utilization rate is 0, determining that the node state is unhealthy.

The following describes the embodiment of the present invention in further detail by using a specific application example, and fig. 5 shows a flowchart of a method for managing node manager states in a cluster provided by a specific application example in the embodiment of the present invention. As shown in fig. 5, the method comprises the steps of:

step 510: and evaluating the network delay and the bad disk block of the cluster to obtain a network score and a disk score.

Step 520: judging whether the host resource is exclusive to YARN; if yes, go to step 540; otherwise, go to step 530;

step 530: and evaluating the CPU utilization rate and the memory utilization rate of the cluster to obtain a CPU score and a memory score.

Step 540: judging whether any one of the scores meets respective preset conditions; if yes, go to step 550; otherwise, step 560 is performed.

Step 550: the NodeManager is offline and the configuration is modified.

Before this step is performed, the node status is set to unhealthy.

Step 560: and continuing to operate.

Before this step is performed, the node status is set to healthy.

In this embodiment, the network, the disk, the CPU, and the memory of the host are scored by comprehensively determining the real-time states of the network, the disk, the CPU, and the memory of the host, and when a certain score satisfies a certain score, the NodeManager role of the host is offline, and when conditions are recovered, such as disk repair, memory occupancy reduction, and network delay, satisfy an online condition, that is, when it is determined that the state of the node in the cluster is healthy according to the evaluation of the hardware state of the cluster, the NodeManager is modified to an appropriate parameter to recover the online.

Fig. 6 shows a schematic structural diagram of a node manager state management apparatus in a cluster according to an embodiment of the present invention. As shown in fig. 6, the apparatus 600 includes: an evaluation module 610, a determination module 620, and a management module 630.

The evaluation module 610 is configured to collect network load information of a cluster, and evaluate a hardware state of the cluster according to the network load information; the determining module 620 is configured to determine the health status of the nodes in the cluster according to the evaluation result; the management module 630 is used to perform offline operation on the NodeManager when the state of the node is unhealthy.

In an optional manner, the evaluation module 610 is further configured to:

collecting network load information of a cluster;

In an alternative manner, when the host resource is not exclusively YARN, the evaluation module 610 is further configured to:

evaluating the CPU utilization rate and the memory utilization rate;

the determining module 620 is further configured to:

In an alternative manner, when the host resource is YARN exclusive, the evaluation module 610 is further configured to:

In an optional manner, the apparatus further comprises:

a configuration module 640, configured to reconfigure CPU resources and memory resources;

a modifying module 650, configured to modify a parameter of the NodeManager configuration file to the reconfigured value when the state health of the node in the cluster is determined according to the evaluation of the hardware state of the cluster;

the management module 630 is further configured to perform an online operation on the NodeManager.

In an optional manner, the evaluation module 610 is further configured to:

the determining module 620 is further configured to:

In an optional manner, the evaluation module 610 is further configured to:

checking the running state of the disk through a script;

judging whether the magnetic disk reports errors or not;

the determining module 620 is further configured to:

In an optional manner, the evaluation module 610 is further configured to:

the determining module 620 is further configured to:

In an optional manner, the evaluation module 610 is further configured to:

the determining module 620 is further configured to:

An embodiment of the present invention provides a computer storage medium, where at least one executable instruction is stored in the storage medium, and the executable instruction enables a processor to execute the NodeManager state management method in a cluster in any of the above method embodiments.

An embodiment of the present invention provides a computer program product, where the computer program product includes a computer program stored on a computer storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is caused to execute the NodeManager state management method in a cluster in any of the above method embodiments.

Fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and a specific embodiment of the present invention does not limit a specific implementation of the computing device.

As shown in fig. 7, the computing device may include: a processor (processor)702, a Communications Interface 704, a memory 706, and a communication bus 708.

Wherein: the processor 702, communication interface 704, and memory 706 communicate with each other via a communication bus 708. A communication interface 704 for communicating with network elements of other devices, such as clients or other servers. The processor 702 is configured to execute the program 710, and may specifically execute the NodeManager state management method in the cluster in any of the method embodiments described above.

In particular, the program 710 may include program code that includes computer operating instructions.

The processor 702 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 706 stores a program 710. The memory 706 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A NodeManager state management method in a cluster is characterized by comprising the following steps:

2. The method of claim 1, wherein collecting network load information for a cluster, evaluating a hardware state of the cluster based on the network load information, further comprises:

collecting network load information of a cluster;

3. The method of claim 2, wherein when the host resource is not exclusively YARN, the method further comprises:

evaluating the CPU utilization rate and the memory utilization rate;

4. The method of claim 2, wherein when the host resource is YARN exclusive, the method further comprises:

5. The method according to any one of claims 1-4, further comprising:

reconfiguring CPU resources and memory resources;

and carrying out online operation on the NodeManager.

6. The method of claim 2, wherein the evaluating network latency of the cluster based on the network load information further comprises:

7. The method of claim 2, wherein the evaluating disk status of the cluster further comprises:

checking the running state of the disk through a script;

judging whether the magnetic disk reports errors or not;

8. The method of claim 3, wherein the evaluating CPU usage further comprises:

9. The method of claim 3, wherein the evaluating memory usage further comprises:

10. An apparatus for managing NodeManager status in a cluster, the apparatus comprising:

11. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction, which causes the processor to perform the operations of the NodeManager state management method in a cluster according to any of claims 1 to 9.

12. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform the method of NodeManager state management in a cluster according to any of claims 1-9.