CN113535517B

CN113535517B - Controller cluster node management method and device

Info

Publication number: CN113535517B
Application number: CN202110833833.6A
Authority: CN
Inventors: 蔡多多; 胡志远
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2022-04-12
Anticipated expiration: 2041-07-23
Also published as: CN113535517A

Abstract

The invention relates to the technical field of networks, and provides a method and a device for managing controller cluster nodes, wherein the method comprises the following steps: the controller cluster management system samples the monitoring indexes of each controller and judges whether any one of the collected monitoring indexes of the controllers does not reach the standard; if yes, judging that the corresponding controller is abnormal, and expanding a new controller to be added into the controller cluster; if not, continuing to sample the monitoring index of the corresponding controller; the monitoring indexes comprise one or more of the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state; according to the method, various monitoring indexes of the controller are monitored, and once any one of the monitoring indexes is abnormal, the number of controllers in the controller cluster is increased as required; the controller cluster node management method and device provided by the invention realize intelligent operation and maintenance of cluster deployment, and improve stability and maintainability.

Description

Controller cluster node management method and device

Technical Field

The present invention relates to the field of network technologies, and in particular, to a method and an apparatus for managing controller cluster nodes.

Background

The controller can be deployed in a cluster mode in engineering application, and with the development of network technologies such as virtualization, cloud computing, platform universalization and SDN/NFV, the cluster deployment and management of the controller are applied more and more. The universal server resources are adopted, a resource pool is built, and users apply and use the resources according to needs, so that the network construction cost is greatly reduced, and the network resource utilization rate is also greatly improved.

The controller is deployed in a cluster, data are stored in each node, and each node can obtain full service data; and the controller cluster management system monitors the state and the change of the cluster nodes in real time, sets a threshold value for detection, and dynamically changes the cluster scale.

The method comprises the steps that a server resource pool is deployed, controller nodes are created as required when expansion is detected, node resources can be deleted and recovered in time when the expansion is not required any more, dynamic application and deletion of the controller nodes are carried out by a controller cluster management system according to a real-time detection result, automatic expansion and contraction capacity of a controller cluster is achieved, stability and reliability of the cluster are improved, operation and maintenance management is simplified, and customer experience is improved.

In the prior art, when a controller cluster is created, the number of controllers in the controller cluster needs to be defined first, for example, one controller may control 50 switches, and now, an SDN controller cluster needs to be created to control 300 switches, and at this time, the created controller cluster needs to include at least 6 controllers, so that a controller cluster is created for the 6 controllers. However, when the number of switches in the network increases, for example, to 400 switches, at this time, since the number of controllers in the controller cluster is predefined, when the traffic increases or decreases, the number of controllers in the controller cluster cannot be dynamically increased or decreased, and thus the dynamic demand of the traffic cannot be satisfied.

In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

when the utilization rate of the CPU of the controller in the controller cluster exceeds a set utilization rate threshold, the utilization rate of the memory exceeds a set utilization rate threshold, the number of device connections is overloaded, the number of tunnel traffic is overloaded, the process state is abnormal, the cluster state is abnormal, or the network state is abnormal, the number of controllers cannot be increased as needed.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for managing node of a controller cluster, including: the controller cluster management system samples the monitoring indexes of each controller and judges whether any one of the collected monitoring indexes of the controllers does not reach the standard; if yes, judging that the corresponding controller is abnormal, and expanding a new controller to be added into the controller cluster; if not, continuing to sample the monitoring index of the corresponding controller;

the monitoring index comprises one or more of the utilization rate of a CPU, the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state.

Preferably, if the collected utilization rate of the CPU or the utilization rate of the memory of the controller is greater than a set utilization rate threshold, it is determined that the collected utilization rate of the CPU or the utilization rate of the memory of the controller does not reach the standard; the set utilization threshold is associated with the collected device connection number and tunnel traffic number of the corresponding controller.

Preferably, the manner of determining whether the collected CPU usage rate or memory usage rate of the controller is greater than the set usage rate threshold includes:

when the utilization rates of the CPUs of the controllers acquired for N times are all larger than the set utilization rate threshold value, judging that the acquired utilization rates of the CPUs of the controllers are larger than the set utilization rate threshold value;

when the utilization rates of the memories of the controllers collected for N times are all larger than the set utilization rate threshold value, judging that the utilization rate of the collected memories of the controllers is larger than the set utilization rate threshold value;

wherein N is a natural number greater than or equal to 2.

Preferably, any one of the device connection number, the tunnel service number, the process state, the cluster state, or the network state does not reach the standard, and the method includes:

the collected device connection number is larger than the set connection number, or,

the number of the collected tunnel services is larger than the set number of the services, or,

the collected process state is abnormal, or,

the collected cluster state is abnormal, or,

and the collected network state is abnormal.

Preferably, in one acquisition, when all the monitoring indexes of the abnormal controllers reach the standard, the corresponding controller is judged to be recovered to be normal, and a controller with the lowest comprehensive utilization rate is recovered, wherein the comprehensive utilization rate is associated with one or more of the acquired utilization rate of the CPU, the utilization rate of the memory, the connection number of the equipment and the number of tunnel services.

Preferably, one controller is selected from the slave controller cluster as a master node, and the rest controllers are used as slave nodes; and if the abnormal controller is the slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal.

Preferably, one controller is selected from the slave controller cluster as a master node, and the rest controllers are used as slave nodes; if the abnormal controller is the slave node, the master node gradually increases the period of sending the heartbeat message to the corresponding slave node from the initial period by the incremental step size to the set maximum period and then sends the heartbeat message to the corresponding slave node by the set maximum period.

Preferably, the incremental step size is positively correlated with the set usage threshold.

Preferably, if the abnormal slave node returns to normal, the master node returns to send the heartbeat message to the corresponding slave node in the initial period.

In a second aspect, the present invention provides a controller cluster node management apparatus for implementing the controller cluster node management method in the first aspect, where the controller cluster node management apparatus includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the processor for performing the controller cluster node management method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

monitoring indexes of the controller cluster controller comprise one or more of the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the connection number of equipment, the number of tunnel services, the process state, the cluster state and the network state, and once any one of the monitoring indexes is abnormal, the number of controllers in the controller cluster is increased as required; the controller cluster node management method provided by the invention realizes automatic intelligent operation and maintenance of cluster deployment, and improves the stability, reliability and maintainability of a controller cluster system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of a method for managing nodes of a controller cluster according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for managing nodes of a controller cluster according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a controller cluster node management apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1:

an embodiment of the present invention provides a method for managing a controller cluster node, as shown in fig. 1, including:

in step 201, the controller cluster management system samples the monitoring index of each controller.

Each controller comprises a master node and slave nodes, one controller cluster only has one master node at the same time, and the rest are slave nodes.

And if any one of the acquired monitoring indexes of the master node does not reach the standard, judging that the master node is abnormal, expanding a new controller to be used as a slave node to be added into the controller cluster, selecting one slave node as the master node in the controller cluster by all the slave nodes in the controller cluster, and reestablishing the contact between the master node and other slave nodes.

And if any one of the collected monitoring indexes of the slave nodes does not reach the standard, judging that the slave nodes are abnormal, expanding a new controller to serve as the slave nodes to be added into the controller cluster, and establishing the relation between the master node and the newly added slave nodes in the controller cluster.

In step 202, it is determined whether any of the collected monitoring indicators of the controller does not meet the standard.

In step 203, if yes, it is determined that the corresponding controller is abnormal, and a new controller is expanded to join the controller cluster; if not, returning to step 201, and continuing to sample the monitoring index of the corresponding controller.

And the controller cluster management system redistributes the equipment connection number and the tunnel service number which cannot be controlled by the abnormal controller to the new controller and the controller without the abnormality in a load sharing manner, and continuously monitors the monitoring index after redistribution.

If the abnormal conditions occur, the user of the controller cluster management system newly accesses the equipment connection number and the tunnel service number, the controller cluster management system redistributes the newly accessed equipment connection number and the tunnel service number and the equipment connection number and the tunnel service number which cannot be controlled by the abnormal controller to the new controller and the controller which does not have the abnormal conditions in a load sharing mode, and after redistribution, the monitoring index is continuously monitored.

And when the monitoring indexes only comprise the utilization rate of the CPU and the utilization rate of the memory, the controller cluster management system samples the utilization rate of the CPU and the utilization rate of the memory of each controller, if the collected utilization rate of the CPU of the controller is greater than a set utilization rate threshold value and the collected utilization rate of the memory is greater than a set utilization rate threshold value, the corresponding controller is judged to be abnormal, and a new controller is expanded and added into the controller cluster.

When the monitoring index only comprises the equipment connection number and the tunnel service number, the controller cluster management system samples the equipment connection number and the tunnel service number of each controller, if the collected equipment connection number of the controller is larger than the set connection number and the tunnel service number is larger than the set service number, the corresponding controller is judged to be abnormal, and a new controller is expanded to be added into the controller cluster.

And when the monitoring index only comprises a process state, a cluster state and a network state, the controller cluster management system samples the process state, the cluster state and the network state of each controller, if the process state, the cluster state and the network state of the acquired controller are not all normal, the corresponding controller is judged to be abnormal, and a new controller is expanded to be added into the controller cluster.

The method for judging whether the specified item in the collected monitoring index of the controller does not reach the standard includes two modes:

the first mode is as follows: if the specified item in the monitoring index of the controller is abnormal in one acquisition, judging that the specified item in the acquired monitoring index of the controller does not reach the standard;

the second way is: if the specified item in the monitoring index of the controller is abnormal in the continuous N-time collection, judging that the specified item in the collected monitoring index of the controller does not reach the standard; wherein N is a natural number greater than or equal to 2.

In the embodiment of the invention, if the collected CPU utilization rate or the memory utilization rate of the controller is greater than the set utilization rate threshold, the collected CPU utilization rate or the memory utilization rate of the controller is judged not to reach the standard; the set utilization threshold is associated with the collected device connection number and tunnel traffic number of the corresponding controller.

In the embodiment of the invention, the set utilization rate threshold is dynamically associated with the acquired equipment connection number and the tunnel service number of the corresponding controller, so that the use condition of resources can be fed back more accurately and effectively, and the threshold is set by combining the actual equipment connection number and the tunnel service number; compared with the existing fixed threshold, the resource utilization rate of the feedback system can be more reasonable, the practical application is better met, and once the utilization rate of a CPU or a memory of the system is too high, the fault can be detected more quickly.

If the monitoring index includes the device connection number and the tunnel service number in addition to the CPU utilization and the memory utilization, the set utilization threshold is associated with the collected device connection number and the collected tunnel service number of the corresponding controller, where the tunnel service number has a larger influence on the set utilization threshold than the device connection number.

The relationship between the set utilization threshold and the number of device connections and the number of tunnel services may be represented as:

P＝P_init+(P_max-P_init)*[(B_t/B)²+(1-B_t/B)*D_t/D] (1)

wherein P is a set utilization rate threshold value, P_initIs the average of the initial CPU usage and the initial memory usage, P_maxIs the average of the maximum CPU usage and the maximum memory usage, B_tFor the number of tunnel traffic collected, B for the number of traffic set, D_tD is the set connection number for the collected equipment connection number.

The connection number of the devices in the network is generally planned in advance, the probability of newly adding the devices at the later stage is relatively small, but the probability of adding the tunnel service is large; therefore, in the controller cluster node management method provided in the embodiment of the present invention, the number of tunnel services influencing factor is increased with the increase of the number of tunnel services, and after the number of tunnel services exceeds 50% of the full specification, the weight of the number of tunnel services is increased in the set utilization rate threshold.

For example:

in the prior art, the threshold is typically set at 80%.

In the embodiment of the invention, the average value P of the initial CPU utilization rate and the initial memory utilization rate is set_init30%, the average value P of the maximum CPU utilization and the maximum memory utilization_max80%, the set number of services B is 3000, and the set number of connections D is 10000.

In the first case: controller cluster system without any service and equipment connection, i.e. B_tIs 0, D_tIs also 0; the set utilization rate threshold value calculated by the formula (1) is 30%, namely the system can be judged to be abnormal as long as the average value of the collected CPU utilization rate and the collected memory utilization rate is higher than 30%; however, in the prior art, the system is determined to be abnormal only when the average value of the collected CPU utilization and the collected memory utilization is higher than 80%; therefore, in the case of no connection between any service and any device, the set utilization rate threshold dynamically associated with the number of tunnel services and the number of device connections in the controller cluster node management method provided in the embodiment of the present invention may detect a system abnormality more quickly.

In the second case: number of service in tunnel of controller cluster system B_tIs 1500 and the number of device connections D_tIn the case of 5000, that is, when the number of tunnel services is only half of the set number of services of the full specification, and the number of device connections is only half of the set number of connections of the full specification; the set utilization rate threshold value calculated by the formula (1) is 55%, namely the system can be judged to be abnormal as long as the average value of the collected CPU utilization rate and the collected memory utilization rate is higher than 55%; however, in the prior art, the average value of the collected CPU usage rate and the collected memory usage rate is required to be higher than 80%The system is judged to be abnormal; therefore, in the controller cluster node management method provided by the embodiment of the present invention, the set utilization threshold dynamically associated with the number of tunnel services and the number of device connections may be used to detect the system abnormality more quickly when the number of tunnel services is only half of the number of services set in a full specification and the number of device connections is only half of the number of connection set in a full specification.

In addition, during actual use, the number of tunnel services is only half of the set number of services in the full specification, and the number of equipment connections is only half of the set number of connections in the full specification, so that the average value of the collected CPU utilization rate and the collected memory utilization rate generally does not exceed 55%; under the condition, whether the system is abnormal or not is judged by monitoring the average value of the collected CPU utilization rate and the collected memory utilization rate continuously; therefore, the monitoring indexes in the controller cluster node management method provided in the embodiment of the present invention further include a process state, a cluster state, and a network state, and once one of the collected process state, the cluster state, and the network state is abnormal, it may be determined that the system is abnormal, and it is not necessary to detect the system abnormal until an average of the collected CPU utilization and the collected memory utilization exceeds 55%.

In the third case: number of service in tunnel of controller cluster system B_t3000 and device connection number D_t10000, namely when the number of tunnel services and the number of equipment connections are all full-load specifications; the set usage threshold calculated by equation (1) is 80%, which is the same as the threshold set in the prior art; therefore, when the number of tunnel services and the number of device connections are both full-load specifications, and the monitoring index only includes the usage rate of the CPU, the usage rate of the memory, the number of tunnel services, and the number of device connections, the set usage rate threshold value dynamically associated with the number of tunnel services and the number of device connections in the controller cluster node management method provided in the embodiment of the present invention is equivalent to the fixed threshold value in the prior art.

In the controller cluster node management method provided by the embodiment of the present invention, the set utilization rate threshold dynamically associated with the number of tunnel services and the number of device connections is compared with a fixed threshold in the prior art, and compared with a technical method in which a monitoring index only includes the utilization rate of a CPU and the utilization rate of a memory in the prior art, the method can feed back the utilization rate of system resources more reasonably, better meet practical application, detect a fault more quickly, and have a better detection strategy.

In the embodiment of the present invention, the manner of determining whether the collected CPU utilization or memory utilization of the controller is greater than the set utilization threshold includes:

and when the CPU utilization rates of the controllers acquired for N times are all larger than the set utilization rate threshold value, judging that the acquired CPU utilization rates of the controllers are larger than the set utilization rate threshold value.

And when the utilization rates of the memories of the controllers acquired for N times are all larger than the set utilization rate threshold value, judging that the acquired utilization rates of the memories of the controllers are larger than the set utilization rate threshold value.

Wherein N is a natural number greater than or equal to 2.

Taking N as 3 for example, when the utilization rates of the CPUs of the controllers acquired for 3 consecutive times are all greater than the set utilization rate threshold, it is determined that the acquired utilization rate of the CPUs of the controllers is greater than the set utilization rate threshold.

Continuous means the accumulation of the 1 st, 2 nd and 3 rd immediately adjacent times, not simply the number.

For example: the utilization rate of the CPU of the controller acquired in the 1 st time is greater than a set utilization rate threshold, and the number of abnormal times is increased by 1 to obtain the number of abnormal times of 1; immediately after the utilization rate of the CPU of the controller acquired in the step 2 is also larger than a set utilization rate threshold, adding 1 to the abnormal times to obtain the abnormal times of 2; then, the utilization rate of the CPU of the controller acquired in the 3 rd time is still larger than a set utilization rate threshold, and the abnormal times are added with 1 to obtain 3 abnormal times; it is determined that the collected usage of the CPU of the corresponding controller is greater than the set usage threshold.

And if the utilization rate of the CPU of the controller acquired in one of the three immediately adjacent times is less than or equal to the set utilization rate threshold value, determining that the utilization rate of the CPU of the corresponding controller is not greater than the set utilization rate threshold value.

For example: the utilization rate of the CPU of the controller acquired in the 1 st time is greater than a set utilization rate threshold, and the number of abnormal times is increased by 1 to obtain the number of abnormal times of 1; immediately after the utilization rate of the CPU of the controller acquired in the step 2 is smaller than a set utilization rate threshold value, setting the abnormal frequency as 0, and obtaining the abnormal frequency as 0; then, the utilization rate of the CPU of the controller acquired in the 3 rd time is larger than a set utilization rate threshold, and the number of abnormal times is increased by 1 to obtain the number of abnormal times as 1; then it is determined that the collected usage of the CPU of the corresponding controller is not greater than the set usage threshold.

In the embodiment of the present invention, the step of not meeting any of the device connection number, the tunnel service number, the process state, the cluster state, or the network state includes:

the collected process state is abnormal, or,

the collected cluster state is abnormal, or,

and the collected network state is abnormal.

The manner of determining whether the collected device connection number is greater than the set connection number may be determined according to whether the collected device connection number is greater than the set connection number, or may be determined according to whether the collected device connection numbers are greater than the set connection number N times in succession.

The manner of determining whether the number of the acquired tunnel services is greater than the set number of the services may be determined according to whether the number of the tunnel services acquired at one time is greater than the set number of the services, or may be determined according to whether the number of the tunnel services acquired N times is greater than the set number of the services.

The manner of determining whether the acquired process state is abnormal may be determining according to whether the process state acquired once is abnormal, or determining according to whether all the process states acquired N times are abnormal.

The manner of determining whether the collected cluster state is abnormal may be determined according to whether the cluster state collected once is abnormal, or may be determined according to whether all the cluster states collected N times are abnormal.

The manner of determining whether the acquired network state is abnormal may be determining according to whether the network state acquired once is abnormal, or determining according to whether the network states acquired N times are all abnormal.

Wherein N is a natural number greater than or equal to 2.

In the embodiment of the invention, in one acquisition, when all the monitoring indexes of the abnormal controllers reach the standard, the corresponding controllers are judged to be recovered to be normal, and the controller with the lowest comprehensive utilization rate is recovered, wherein the comprehensive utilization rate is associated with one or more of the acquired CPU utilization rate, the acquired memory utilization rate, the acquired equipment connection number and the acquired tunnel service number.

If the monitoring index only comprises the utilization rate of the CPU, the comprehensive utilization rate is the collected utilization rate of the CPU; if the monitoring indexes comprise the utilization rate of the CPU, the utilization rate of the memory, the equipment connection number and the tunnel service number, the comprehensive utilization rate is obtained by weighting and calculating the collected utilization rate of the CPU, the collected utilization rate of the memory, the equipment load rate and the tunnel load rate; the equipment load rate is obtained by dividing the collected equipment connection number by the set connection number; the tunnel load rate is obtained by dividing the collected tunnel service number by the set service number.

The method comprises the following steps of recovering a controller with the lowest comprehensive utilization rate, specifically recovering a slave node with the lowest comprehensive utilization rate; and if the controller with the lowest comprehensive utilization rate is the master node, skipping the corresponding master node, and selecting one slave node with the lowest comprehensive utilization rate from the rest slave nodes for recycling.

In one acquisition, when all the monitoring indexes of the abnormal controller reach the standard, the corresponding controller is determined to be recovered to be normal, for example, taking the monitoring indexes including the utilization rate of the CPU, the utilization rate of the memory, the number of device connections, the number of tunnel services, the process state, the cluster state, and the network state as examples, the standard for determining that the corresponding controller is recovered to be normal is as follows:

the collected CPU utilization rate is less than or equal to the set utilization rate threshold value, and,

the utilization rate of the acquired memory is less than or equal to the set utilization rate threshold, and,

the collected connection number of the equipment is less than or equal to the set connection number, and,

the number of the collected tunnel services is less than or equal to the set number of the services, and,

the collected process state is normal, and,

the collected cluster state is normal, and,

the collected network state is normal.

In the embodiment of the invention, one controller is selected from the slave controller cluster as a master node, and the rest controllers are used as slave nodes; if the abnormal controller is a slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal; and after the slave node is recovered to be normal, the master node continues to send the data message to the slave node in the initial period of sending the data message.

In order to reduce the service burden of the controller system, if an abnormal controller is a slave node, the master node may gradually increase the period of sending the heartbeat message to the corresponding slave node from the initial period to the set maximum period by an increasing step length while stopping sending the data message to the corresponding slave node, and then send the heartbeat message to the corresponding slave node by the maximum period; the initial period of sending the data message to the slave node by the master node may be the same as or different from the initial period of sending the heartbeat message.

And after the slave node is recovered to be normal, the master node continues to send the data message to the slave node in the initial period of sending the data message, and sends the heartbeat message to the slave node in the initial period of sending the heartbeat message.

And if the abnormal controller is the master node, expanding a new controller as a slave node to be added into the controller cluster, selecting one slave node as a new master node by all the slave nodes in the controller cluster, and sending the data message to other slave nodes by the new master node.

In the embodiment of the invention, in order to quickly confirm the abnormal controllers, one controller is selected from the controller cluster as a main node, and the rest controllers are used as slave nodes; if the abnormal controller is the slave node, the master node sends the heartbeat message to the corresponding slave node in a period which is gradually increased from the initial period to the set maximum period in an increasing step length, and then sends the heartbeat message to the corresponding slave node in the maximum period.

If the abnormal controller is the master node, a new controller is expanded to be used as a slave node to be added into the controller cluster, all the slave nodes in the controller cluster select one slave node to be used as the new master node, and the new master node sends heartbeat messages to other slave nodes in an initial period.

The gradual increment can be continuously and sequentially increased or increased at intervals of k times, and k is a natural number greater than 0.

Taking the continuous sequential increment as an example, the gradual increment is represented as:

an initial period; initial period + incremental step; initial period + incremental step size; initial period + incremental step length; … …

For example, before the slave node is abnormal, the master node sends a heartbeat message to the corresponding slave node, and the initial period for sending the heartbeat message is 100 ms; after the slave node is abnormal, the period of the master node sending the heartbeat message to the corresponding slave node is gradually increased from an initial period of 100ms to a set maximum period of 500ms by an incremental step length P × 80ms, and then the heartbeat message is sent to the corresponding slave node by the set maximum period of 500 ms.

Wherein, P is a set utilization rate threshold value.

After the slave node is abnormal, the cycle of sending the heartbeat message to the corresponding slave node by the master node is as follows:

100ms；100ms+P*80ms；100ms+2*P*80ms；100ms+3*P*80ms；……；500ms；500ms；……

and after the slave node is abnormal, a new controller is expanded to be used as the slave node to be added into the controller cluster to bear the service.

Taking k as 1, i.e. 1 increments apart, the stepwise increment is represented as:

an initial period, an initial period; initial period + incremental step; initial period + incremental step; initial period + incremental step length; … …

In the embodiment of the present invention, the incremental step size is positively correlated with the set utilization rate threshold, specifically: the incremental step size may be in a positive linear correlation with the set usage threshold, or the incremental step size may be in a positive non-linear correlation with the set usage threshold.

When the set usage threshold is associated with the collected device connection number and tunnel traffic number of the corresponding controller, the incremental step size is also associated with the collected device connection number and tunnel traffic number of the corresponding controller.

Setting the incremental step length to be related to a set utilization rate threshold value, so that the period of sending the heartbeat message to the corresponding slave node by the master node is dynamically related to the set utilization rate threshold value; when the set utilization rate threshold is associated with the acquired equipment connection number and the acquired tunnel service number of the corresponding controller, the period of sending the heartbeat message to the corresponding slave node by the master node is also associated with the acquired equipment connection number and the acquired tunnel service number of the corresponding controller, so that the waste of resources is prevented, and the controller can quickly confirm when the controller is abnormal. In the embodiment of the invention, if the abnormal slave node returns to normal, the master node returns to send the heartbeat message to the corresponding slave node in the initial period.

And if the abnormal master node is recovered to be normal, the normal recovered master node is used as a new slave node to be added into the controller cluster, and the new master node sends a heartbeat message to the new slave node in an initialization period.

Example 2:

a description is given to a controller cluster node management method with reference to fig. 2 and this embodiment, in this embodiment, a single controller is taken as an example; the monitoring indexes comprise the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state; and a second mode is adopted for judging whether the specified item in the collected monitoring indexes of the controller does not reach the standard: if the specified item in the monitoring index of the controller is abnormal in the continuous N-time collection, judging that the specified item in the collected monitoring index of the controller does not reach the standard; wherein N is a natural number greater than or equal to 2; in this embodiment, each index in the monitoring indexes of a single controller is determined by continuously acquiring N times.

As shown in fig. 2, the deployment scale of the controller cluster is M controllers, where M is a natural number greater than 1; selecting one controller from the M controllers as a main node, and using the rest controllers as slave nodes; the specification data of each controller in the M controllers are consistent, and the method comprises the following steps:

the set equipment number is D;

setting the number of the services as B;

average of initial CPU usage and initial memory usage: p_init；

Average of maximum CPU utilization and maximum memory utilization: p_max；

Setting a usage threshold P ═ P_init+(P_max-P_init)*[(B_t/B)²+(1-B_t/B)*D_t/D]；

When sampling is carried out to the control index of single controller at every turn, the data of gathering simultaneously include: collected device connection number D_tAnd the collected tunnel service number B_tAnd the collected CPU utilization rate P_t1And the utilization rate P of the acquired memory_t2Collected process state S_t1Collected cluster state S_t2And the collected network state S_t3。

Sampling node for monitoring index of single controllerAnd after the data is processed and analyzed, wherein the set utilization rate threshold value can dynamically relate the connection number D of the equipment acquired each time_tAnd the collected tunnel service number B_tI.e. the set usage threshold is not a fixed value.

The analysis of the simultaneously acquired data proceeds as follows:

comparison D_tAnd the size of D, if D_tD or less, the abnormal times are N_t1Setting to 0; if D is_tIf D is greater than D, adding 1 to the abnormal times to obtain the abnormal times N_t1。

Comparison B_tAnd the size of B, if B_tIf B is less than or equal to B, the number of abnormal times is N_t2Setting to 0; if B is_tIf the number of abnormal times is more than B, adding 1 to the number of abnormal times to obtain the number of abnormal times N_t2。

Comparison P_t1And the size of P, if P_t1P or less, the abnormal times are N_t3Setting to 0; if P_t1If the number of abnormal times is more than P, adding 1 to the number of abnormal times to obtain the number of abnormal times N_t3。

Comparison P_t2And the size of P, if P_t2P or less, the abnormal times are N_t4Setting to 0; if P_t2If the number of abnormal times is more than P, adding 1 to the number of abnormal times to obtain the number of abnormal times N_t4。

Judgment S_t1If it is abnormal, if S_t1Normal, abnormal times N_t5Setting to 0; if S_t1If the abnormal frequency is abnormal, adding 1 to the abnormal frequency to obtain the abnormal frequency N_t5。

Judgment S_t2If it is abnormal, if S_t2Normal, abnormal times N_t6Setting to 0; if S_t2If the abnormal frequency is abnormal, adding 1 to the abnormal frequency to obtain the abnormal frequency N_t6。

Judgment S_t3If it is abnormal, if S_t3Normal, abnormal times N_t7Setting to 0; if S_t3If the abnormal frequency is abnormal, adding 1 to the abnormal frequency to obtain the abnormal frequency N_t7。

Next, if an abnormality occurs in all of the 3 consecutive acquisitions in the specified item in the monitoring index of the controller, it is determined that the specified item in the acquired monitoring index of the controller does not meet the standard.

Then, the number of times that the connection number of the acquired equipment at the A-th time is larger than the set connection number is A_t1The number of times that the tunnel service number acquired at the A-th time is larger than the set service number is A_t2The number of times that the utilization rate of the CPU acquired in the A-th time is larger than the set utilization rate threshold is A_t3The number of times that the utilization rate of the memory acquired at the A-th time is greater than the set utilization rate threshold is A_t4The abnormal times of the process state acquired at the A-th time are A_t5The abnormal times of the cluster state acquired at the A-th time are A_t6The abnormal times of the network state collected at the A-th time is A_t7。

If A_t1、A_t2、A_t3、A_t4、A_t5、A_t6And A_t7If any one of the values is equal to 3, judging that the corresponding controller is abnormal, and comparing A_t1、A_t2、A_t3、A_t4、A_t5、A_t6And A_t7And the initialization is set to be 0, and a new controller is expanded to be used as a slave node to join the controller cluster to bear the service.

If the controller with the abnormality is the slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal, and the cycle of sending the heartbeat message to the corresponding slave node by the master node is gradually increased from the initial cycle to the set maximum cycle by the incremental step length and then is sent to the corresponding slave node by the set maximum cycle; in the process, the controller cluster management system continues to sample the corresponding slave nodes, and in one acquisition, when all the monitoring indexes of the corresponding slave nodes reach the standard, the corresponding slave nodes are judged to be recovered to be normal, the master node continues to send data messages to the corresponding slave nodes in the initial period of sending the data messages, and sends heartbeat messages to the corresponding slave nodes in the initial period of sending the heartbeat messages; then, the controller cluster management system redistributes the device connection number and the tunnel service number of each node in a load sharing mode, and returns to the controller cluster management system after the redistribution of the device connection number and the tunnel service number of each node is completed by the controller cluster management systemReceiving a slave node with the lowest comprehensive utilization rate; the comprehensive utilization rate and D_t、B_t、P_t1And P_t2Associating, in particular, assigning D_t/D、B_t/B、P_t1And P_t2And performing weighting calculation on different weights to obtain the comprehensive utilization rate, for example: (D) rate of comprehensive utilization_t/D)*20％+(B_t/B)*20％+P_t1*30％+P_t2*30％。

If A_t1、A_t2、A_t3、A_t4、A_t5、A_t6And A_t7If all the items in the data are less than 3, continuing to sample the monitoring index of the corresponding controller for the next time; updating A obtained after next sampling analysis_t1、A_t2、A_t3、A_t4、A_t5、A_t6And A_t7Continuously judging the updated A_t1、A_t2、A_t3、A_t4、A_t5、A_t6And A_t7Whether any one is equal to 3.

Example 3:

an embodiment of the present invention provides a controller cluster node management apparatus, as shown in fig. 3, including one or more processors 21 and a memory 22. In fig. 3, one processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or other means, such as the bus connection in fig. 3.

The memory 22, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs and non-volatile computer-executable programs, such as the controller cluster node management method in embodiment 1. The processor 21 executes the controller cluster node management method by executing non-volatile software programs and instructions stored in the memory 22.

The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the controller cluster node management methods of embodiments 1 and 2 described above, for example, perform the steps shown in fig. 1 and 2 described above.

It should be noted that, for the information interaction, execution process and other contents between the modules and units in the apparatus and system, the specific contents may refer to the description in the embodiment of the method of the present invention because the same concept is used as the embodiment of the processing method of the present invention, and are not described herein again.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for managing controller cluster nodes is characterized by comprising the following steps: the controller cluster management system samples the monitoring indexes of each controller and judges whether any one of the collected monitoring indexes of the controllers does not reach the standard; if yes, judging that the corresponding controller is abnormal, and expanding a new controller to be added into the controller cluster; if not, continuing to sample the monitoring index of the corresponding controller;

the monitoring indexes comprise one or more of the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state;

if the collected CPU utilization rate or the memory utilization rate of the controller is larger than a set utilization rate threshold, judging that the collected CPU utilization rate or the memory utilization rate of the controller does not reach the standard; the set utilization rate threshold is associated with the collected device connection number and the tunnel service number of the corresponding controller; the relationship between the set utilization rate threshold and the device connection number and the tunnel service number is represented as follows: p ═ P_init+(P_max-P_init)*[(B_t/B)²+(1-B_t/B)*D_t/D]Wherein P is a set utilization threshold value, P_initIs the average of the initial CPU usage and the initial memory usage, P_maxIs the average of the maximum CPU usage and the maximum memory usage, B_tFor the number of tunnel traffic collected, B for the number of traffic set, D_tD is the set connection number for the collected equipment connection number.

2. The method according to claim 1, wherein the determining whether the collected CPU utilization or memory utilization of the controller is greater than the set utilization threshold includes:

wherein N is a natural number greater than or equal to 2.

3. The controller cluster node management method of claim 1, wherein any one of a device connection number, a tunnel traffic number, a process state, a cluster state, or a network state does not meet a standard, comprising:

the collected process state is abnormal, or,

the collected cluster state is abnormal, or,

and the collected network state is abnormal.

4. The controller cluster node management method according to claim 1, wherein in one acquisition, when all items in the monitoring indexes of the abnormal controllers reach the standard, it is determined that the corresponding controller is recovered to be normal, and a controller with the lowest comprehensive utilization rate is recovered, wherein the comprehensive utilization rate is associated with one or more of the acquired CPU utilization rate, memory utilization rate, device connection number and tunnel service number; if the monitoring index only comprises the utilization rate of the CPU, the comprehensive utilization rate is the collected utilization rate of the CPU; if the monitoring indexes comprise the utilization rate of the CPU, the utilization rate of the memory, the equipment connection number and the tunnel service number, the comprehensive utilization rate is obtained by weighting and calculating the collected utilization rate of the CPU, the collected utilization rate of the memory, the equipment load rate and the tunnel load rate; the equipment load rate is obtained by dividing the collected equipment connection number by the set connection number; the tunnel load rate is obtained by dividing the collected tunnel service number by the set service number.

5. The controller cluster node management method of claim 1, wherein one controller is elected from the controller cluster as a master node and the remaining controllers are taken as slave nodes; and if the abnormal controller is the slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal.

6. The controller cluster node management method of claim 1, wherein one controller is elected from the controller cluster as a master node and the remaining controllers are taken as slave nodes; if the abnormal controller is the slave node, the master node gradually increases the period of sending the heartbeat message to the corresponding slave node from the initial period by the incremental step size to the set maximum period and then sends the heartbeat message to the corresponding slave node by the set maximum period.

7. The controller cluster node management method of claim 6, wherein the incremental step size is positively correlated to the set usage threshold.

8. The controller cluster node management method of claim 6, wherein if the abnormal slave node returns to normal, the master node returns to sending the heartbeat packet to the corresponding slave node at an initial period.

9. An apparatus for controller cluster node management, the apparatus comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for performing the controller cluster node management method of any of claims 1-8.