CN113535517B - Controller cluster node management method and device - Google Patents

Controller cluster node management method and device Download PDF

Info

Publication number
CN113535517B
CN113535517B CN202110833833.6A CN202110833833A CN113535517B CN 113535517 B CN113535517 B CN 113535517B CN 202110833833 A CN202110833833 A CN 202110833833A CN 113535517 B CN113535517 B CN 113535517B
Authority
CN
China
Prior art keywords
controller
utilization rate
collected
abnormal
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110833833.6A
Other languages
Chinese (zh)
Other versions
CN113535517A (en
Inventor
蔡多多
胡志远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fiberhome Telecommunication Technologies Co Ltd
Original Assignee
Fiberhome Telecommunication Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fiberhome Telecommunication Technologies Co Ltd filed Critical Fiberhome Telecommunication Technologies Co Ltd
Priority to CN202110833833.6A priority Critical patent/CN113535517B/en
Publication of CN113535517A publication Critical patent/CN113535517A/en
Application granted granted Critical
Publication of CN113535517B publication Critical patent/CN113535517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q9/00Arrangements in telecontrol or telemetry systems for selectively calling a substation from a main station, in which substation desired apparatus is selected for applying a control signal thereto or for obtaining measured values therefrom
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q2209/00Arrangements in telecontrol or telemetry systems
    • H04Q2209/70Arrangements in the main station, i.e. central controller

Abstract

The invention relates to the technical field of networks, and provides a method and a device for managing controller cluster nodes, wherein the method comprises the following steps: the controller cluster management system samples the monitoring indexes of each controller and judges whether any one of the collected monitoring indexes of the controllers does not reach the standard; if yes, judging that the corresponding controller is abnormal, and expanding a new controller to be added into the controller cluster; if not, continuing to sample the monitoring index of the corresponding controller; the monitoring indexes comprise one or more of the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state; according to the method, various monitoring indexes of the controller are monitored, and once any one of the monitoring indexes is abnormal, the number of controllers in the controller cluster is increased as required; the controller cluster node management method and device provided by the invention realize intelligent operation and maintenance of cluster deployment, and improve stability and maintainability.

Description

Controller cluster node management method and device
Technical Field
The present invention relates to the field of network technologies, and in particular, to a method and an apparatus for managing controller cluster nodes.
Background
The controller can be deployed in a cluster mode in engineering application, and with the development of network technologies such as virtualization, cloud computing, platform universalization and SDN/NFV, the cluster deployment and management of the controller are applied more and more. The universal server resources are adopted, a resource pool is built, and users apply and use the resources according to needs, so that the network construction cost is greatly reduced, and the network resource utilization rate is also greatly improved.
The controller is deployed in a cluster, data are stored in each node, and each node can obtain full service data; and the controller cluster management system monitors the state and the change of the cluster nodes in real time, sets a threshold value for detection, and dynamically changes the cluster scale.
The method comprises the steps that a server resource pool is deployed, controller nodes are created as required when expansion is detected, node resources can be deleted and recovered in time when the expansion is not required any more, dynamic application and deletion of the controller nodes are carried out by a controller cluster management system according to a real-time detection result, automatic expansion and contraction capacity of a controller cluster is achieved, stability and reliability of the cluster are improved, operation and maintenance management is simplified, and customer experience is improved.
In the prior art, when a controller cluster is created, the number of controllers in the controller cluster needs to be defined first, for example, one controller may control 50 switches, and now, an SDN controller cluster needs to be created to control 300 switches, and at this time, the created controller cluster needs to include at least 6 controllers, so that a controller cluster is created for the 6 controllers. However, when the number of switches in the network increases, for example, to 400 switches, at this time, since the number of controllers in the controller cluster is predefined, when the traffic increases or decreases, the number of controllers in the controller cluster cannot be dynamically increased or decreased, and thus the dynamic demand of the traffic cannot be satisfied.
In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.
Disclosure of Invention
The technical problem to be solved by the invention is as follows:
when the utilization rate of the CPU of the controller in the controller cluster exceeds a set utilization rate threshold, the utilization rate of the memory exceeds a set utilization rate threshold, the number of device connections is overloaded, the number of tunnel traffic is overloaded, the process state is abnormal, the cluster state is abnormal, or the network state is abnormal, the number of controllers cannot be increased as needed.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for managing node of a controller cluster, including: the controller cluster management system samples the monitoring indexes of each controller and judges whether any one of the collected monitoring indexes of the controllers does not reach the standard; if yes, judging that the corresponding controller is abnormal, and expanding a new controller to be added into the controller cluster; if not, continuing to sample the monitoring index of the corresponding controller;
the monitoring index comprises one or more of the utilization rate of a CPU, the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state.
Preferably, if the collected utilization rate of the CPU or the utilization rate of the memory of the controller is greater than a set utilization rate threshold, it is determined that the collected utilization rate of the CPU or the utilization rate of the memory of the controller does not reach the standard; the set utilization threshold is associated with the collected device connection number and tunnel traffic number of the corresponding controller.
Preferably, the manner of determining whether the collected CPU usage rate or memory usage rate of the controller is greater than the set usage rate threshold includes:
when the utilization rates of the CPUs of the controllers acquired for N times are all larger than the set utilization rate threshold value, judging that the acquired utilization rates of the CPUs of the controllers are larger than the set utilization rate threshold value;
when the utilization rates of the memories of the controllers collected for N times are all larger than the set utilization rate threshold value, judging that the utilization rate of the collected memories of the controllers is larger than the set utilization rate threshold value;
wherein N is a natural number greater than or equal to 2.
Preferably, any one of the device connection number, the tunnel service number, the process state, the cluster state, or the network state does not reach the standard, and the method includes:
the collected device connection number is larger than the set connection number, or,
the number of the collected tunnel services is larger than the set number of the services, or,
the collected process state is abnormal, or,
the collected cluster state is abnormal, or,
and the collected network state is abnormal.
Preferably, in one acquisition, when all the monitoring indexes of the abnormal controllers reach the standard, the corresponding controller is judged to be recovered to be normal, and a controller with the lowest comprehensive utilization rate is recovered, wherein the comprehensive utilization rate is associated with one or more of the acquired utilization rate of the CPU, the utilization rate of the memory, the connection number of the equipment and the number of tunnel services.
Preferably, one controller is selected from the slave controller cluster as a master node, and the rest controllers are used as slave nodes; and if the abnormal controller is the slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal.
Preferably, one controller is selected from the slave controller cluster as a master node, and the rest controllers are used as slave nodes; if the abnormal controller is the slave node, the master node gradually increases the period of sending the heartbeat message to the corresponding slave node from the initial period by the incremental step size to the set maximum period and then sends the heartbeat message to the corresponding slave node by the set maximum period.
Preferably, the incremental step size is positively correlated with the set usage threshold.
Preferably, if the abnormal slave node returns to normal, the master node returns to send the heartbeat message to the corresponding slave node in the initial period.
In a second aspect, the present invention provides a controller cluster node management apparatus for implementing the controller cluster node management method in the first aspect, where the controller cluster node management apparatus includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the processor for performing the controller cluster node management method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
monitoring indexes of the controller cluster controller comprise one or more of the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the connection number of equipment, the number of tunnel services, the process state, the cluster state and the network state, and once any one of the monitoring indexes is abnormal, the number of controllers in the controller cluster is increased as required; the controller cluster node management method provided by the invention realizes automatic intelligent operation and maintenance of cluster deployment, and improves the stability, reliability and maintainability of a controller cluster system.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a schematic flowchart of a method for managing nodes of a controller cluster according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a method for managing nodes of a controller cluster according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a controller cluster node management apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Example 1:
an embodiment of the present invention provides a method for managing a controller cluster node, as shown in fig. 1, including:
in step 201, the controller cluster management system samples the monitoring index of each controller.
Each controller comprises a master node and slave nodes, one controller cluster only has one master node at the same time, and the rest are slave nodes.
And if any one of the acquired monitoring indexes of the master node does not reach the standard, judging that the master node is abnormal, expanding a new controller to be used as a slave node to be added into the controller cluster, selecting one slave node as the master node in the controller cluster by all the slave nodes in the controller cluster, and reestablishing the contact between the master node and other slave nodes.
And if any one of the collected monitoring indexes of the slave nodes does not reach the standard, judging that the slave nodes are abnormal, expanding a new controller to serve as the slave nodes to be added into the controller cluster, and establishing the relation between the master node and the newly added slave nodes in the controller cluster.
In step 202, it is determined whether any of the collected monitoring indicators of the controller does not meet the standard.
In step 203, if yes, it is determined that the corresponding controller is abnormal, and a new controller is expanded to join the controller cluster; if not, returning to step 201, and continuing to sample the monitoring index of the corresponding controller.
And the controller cluster management system redistributes the equipment connection number and the tunnel service number which cannot be controlled by the abnormal controller to the new controller and the controller without the abnormality in a load sharing manner, and continuously monitors the monitoring index after redistribution.
If the abnormal conditions occur, the user of the controller cluster management system newly accesses the equipment connection number and the tunnel service number, the controller cluster management system redistributes the newly accessed equipment connection number and the tunnel service number and the equipment connection number and the tunnel service number which cannot be controlled by the abnormal controller to the new controller and the controller which does not have the abnormal conditions in a load sharing mode, and after redistribution, the monitoring index is continuously monitored.
The monitoring index comprises one or more of the utilization rate of a CPU, the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state.
And when the monitoring indexes only comprise the utilization rate of the CPU and the utilization rate of the memory, the controller cluster management system samples the utilization rate of the CPU and the utilization rate of the memory of each controller, if the collected utilization rate of the CPU of the controller is greater than a set utilization rate threshold value and the collected utilization rate of the memory is greater than a set utilization rate threshold value, the corresponding controller is judged to be abnormal, and a new controller is expanded and added into the controller cluster.
When the monitoring index only comprises the equipment connection number and the tunnel service number, the controller cluster management system samples the equipment connection number and the tunnel service number of each controller, if the collected equipment connection number of the controller is larger than the set connection number and the tunnel service number is larger than the set service number, the corresponding controller is judged to be abnormal, and a new controller is expanded to be added into the controller cluster.
And when the monitoring index only comprises a process state, a cluster state and a network state, the controller cluster management system samples the process state, the cluster state and the network state of each controller, if the process state, the cluster state and the network state of the acquired controller are not all normal, the corresponding controller is judged to be abnormal, and a new controller is expanded to be added into the controller cluster.
The method for judging whether the specified item in the collected monitoring index of the controller does not reach the standard includes two modes:
the first mode is as follows: if the specified item in the monitoring index of the controller is abnormal in one acquisition, judging that the specified item in the acquired monitoring index of the controller does not reach the standard;
the second way is: if the specified item in the monitoring index of the controller is abnormal in the continuous N-time collection, judging that the specified item in the collected monitoring index of the controller does not reach the standard; wherein N is a natural number greater than or equal to 2.
In the embodiment of the invention, if the collected CPU utilization rate or the memory utilization rate of the controller is greater than the set utilization rate threshold, the collected CPU utilization rate or the memory utilization rate of the controller is judged not to reach the standard; the set utilization threshold is associated with the collected device connection number and tunnel traffic number of the corresponding controller.
In the embodiment of the invention, the set utilization rate threshold is dynamically associated with the acquired equipment connection number and the tunnel service number of the corresponding controller, so that the use condition of resources can be fed back more accurately and effectively, and the threshold is set by combining the actual equipment connection number and the tunnel service number; compared with the existing fixed threshold, the resource utilization rate of the feedback system can be more reasonable, the practical application is better met, and once the utilization rate of a CPU or a memory of the system is too high, the fault can be detected more quickly.
If the monitoring index includes the device connection number and the tunnel service number in addition to the CPU utilization and the memory utilization, the set utilization threshold is associated with the collected device connection number and the collected tunnel service number of the corresponding controller, where the tunnel service number has a larger influence on the set utilization threshold than the device connection number.
The relationship between the set utilization threshold and the number of device connections and the number of tunnel services may be represented as:
P=Pinit+(Pmax-Pinit)*[(Bt/B)2+(1-Bt/B)*Dt/D] (1)
wherein P is a set utilization rate threshold value, PinitIs the average of the initial CPU usage and the initial memory usage, PmaxIs the average of the maximum CPU usage and the maximum memory usage, BtFor the number of tunnel traffic collected, B for the number of traffic set, DtD is the set connection number for the collected equipment connection number.
The connection number of the devices in the network is generally planned in advance, the probability of newly adding the devices at the later stage is relatively small, but the probability of adding the tunnel service is large; therefore, in the controller cluster node management method provided in the embodiment of the present invention, the number of tunnel services influencing factor is increased with the increase of the number of tunnel services, and after the number of tunnel services exceeds 50% of the full specification, the weight of the number of tunnel services is increased in the set utilization rate threshold.
For example:
in the prior art, the threshold is typically set at 80%.
In the embodiment of the invention, the average value P of the initial CPU utilization rate and the initial memory utilization rate is setinit30%, the average value P of the maximum CPU utilization and the maximum memory utilizationmax80%, the set number of services B is 3000, and the set number of connections D is 10000.
In the first case: controller cluster system without any service and equipment connection, i.e. BtIs 0, DtIs also 0; the set utilization rate threshold value calculated by the formula (1) is 30%, namely the system can be judged to be abnormal as long as the average value of the collected CPU utilization rate and the collected memory utilization rate is higher than 30%; however, in the prior art, the system is determined to be abnormal only when the average value of the collected CPU utilization and the collected memory utilization is higher than 80%; therefore, in the case of no connection between any service and any device, the set utilization rate threshold dynamically associated with the number of tunnel services and the number of device connections in the controller cluster node management method provided in the embodiment of the present invention may detect a system abnormality more quickly.
In the second case: number of service in tunnel of controller cluster system BtIs 1500 and the number of device connections DtIn the case of 5000, that is, when the number of tunnel services is only half of the set number of services of the full specification, and the number of device connections is only half of the set number of connections of the full specification; the set utilization rate threshold value calculated by the formula (1) is 55%, namely the system can be judged to be abnormal as long as the average value of the collected CPU utilization rate and the collected memory utilization rate is higher than 55%; however, in the prior art, the average value of the collected CPU usage rate and the collected memory usage rate is required to be higher than 80%The system is judged to be abnormal; therefore, in the controller cluster node management method provided by the embodiment of the present invention, the set utilization threshold dynamically associated with the number of tunnel services and the number of device connections may be used to detect the system abnormality more quickly when the number of tunnel services is only half of the number of services set in a full specification and the number of device connections is only half of the number of connection set in a full specification.
In addition, during actual use, the number of tunnel services is only half of the set number of services in the full specification, and the number of equipment connections is only half of the set number of connections in the full specification, so that the average value of the collected CPU utilization rate and the collected memory utilization rate generally does not exceed 55%; under the condition, whether the system is abnormal or not is judged by monitoring the average value of the collected CPU utilization rate and the collected memory utilization rate continuously; therefore, the monitoring indexes in the controller cluster node management method provided in the embodiment of the present invention further include a process state, a cluster state, and a network state, and once one of the collected process state, the cluster state, and the network state is abnormal, it may be determined that the system is abnormal, and it is not necessary to detect the system abnormal until an average of the collected CPU utilization and the collected memory utilization exceeds 55%.
In the third case: number of service in tunnel of controller cluster system Bt3000 and device connection number Dt10000, namely when the number of tunnel services and the number of equipment connections are all full-load specifications; the set usage threshold calculated by equation (1) is 80%, which is the same as the threshold set in the prior art; therefore, when the number of tunnel services and the number of device connections are both full-load specifications, and the monitoring index only includes the usage rate of the CPU, the usage rate of the memory, the number of tunnel services, and the number of device connections, the set usage rate threshold value dynamically associated with the number of tunnel services and the number of device connections in the controller cluster node management method provided in the embodiment of the present invention is equivalent to the fixed threshold value in the prior art.
In the controller cluster node management method provided by the embodiment of the present invention, the set utilization rate threshold dynamically associated with the number of tunnel services and the number of device connections is compared with a fixed threshold in the prior art, and compared with a technical method in which a monitoring index only includes the utilization rate of a CPU and the utilization rate of a memory in the prior art, the method can feed back the utilization rate of system resources more reasonably, better meet practical application, detect a fault more quickly, and have a better detection strategy.
In the embodiment of the present invention, the manner of determining whether the collected CPU utilization or memory utilization of the controller is greater than the set utilization threshold includes:
and when the CPU utilization rates of the controllers acquired for N times are all larger than the set utilization rate threshold value, judging that the acquired CPU utilization rates of the controllers are larger than the set utilization rate threshold value.
And when the utilization rates of the memories of the controllers acquired for N times are all larger than the set utilization rate threshold value, judging that the acquired utilization rates of the memories of the controllers are larger than the set utilization rate threshold value.
Wherein N is a natural number greater than or equal to 2.
Taking N as 3 for example, when the utilization rates of the CPUs of the controllers acquired for 3 consecutive times are all greater than the set utilization rate threshold, it is determined that the acquired utilization rate of the CPUs of the controllers is greater than the set utilization rate threshold.
Continuous means the accumulation of the 1 st, 2 nd and 3 rd immediately adjacent times, not simply the number.
For example: the utilization rate of the CPU of the controller acquired in the 1 st time is greater than a set utilization rate threshold, and the number of abnormal times is increased by 1 to obtain the number of abnormal times of 1; immediately after the utilization rate of the CPU of the controller acquired in the step 2 is also larger than a set utilization rate threshold, adding 1 to the abnormal times to obtain the abnormal times of 2; then, the utilization rate of the CPU of the controller acquired in the 3 rd time is still larger than a set utilization rate threshold, and the abnormal times are added with 1 to obtain 3 abnormal times; it is determined that the collected usage of the CPU of the corresponding controller is greater than the set usage threshold.
And if the utilization rate of the CPU of the controller acquired in one of the three immediately adjacent times is less than or equal to the set utilization rate threshold value, determining that the utilization rate of the CPU of the corresponding controller is not greater than the set utilization rate threshold value.
For example: the utilization rate of the CPU of the controller acquired in the 1 st time is greater than a set utilization rate threshold, and the number of abnormal times is increased by 1 to obtain the number of abnormal times of 1; immediately after the utilization rate of the CPU of the controller acquired in the step 2 is smaller than a set utilization rate threshold value, setting the abnormal frequency as 0, and obtaining the abnormal frequency as 0; then, the utilization rate of the CPU of the controller acquired in the 3 rd time is larger than a set utilization rate threshold, and the number of abnormal times is increased by 1 to obtain the number of abnormal times as 1; then it is determined that the collected usage of the CPU of the corresponding controller is not greater than the set usage threshold.
In the embodiment of the present invention, the step of not meeting any of the device connection number, the tunnel service number, the process state, the cluster state, or the network state includes:
the collected device connection number is larger than the set connection number, or,
the number of the collected tunnel services is larger than the set number of the services, or,
the collected process state is abnormal, or,
the collected cluster state is abnormal, or,
and the collected network state is abnormal.
The manner of determining whether the collected device connection number is greater than the set connection number may be determined according to whether the collected device connection number is greater than the set connection number, or may be determined according to whether the collected device connection numbers are greater than the set connection number N times in succession.
The manner of determining whether the number of the acquired tunnel services is greater than the set number of the services may be determined according to whether the number of the tunnel services acquired at one time is greater than the set number of the services, or may be determined according to whether the number of the tunnel services acquired N times is greater than the set number of the services.
The manner of determining whether the acquired process state is abnormal may be determining according to whether the process state acquired once is abnormal, or determining according to whether all the process states acquired N times are abnormal.
The manner of determining whether the collected cluster state is abnormal may be determined according to whether the cluster state collected once is abnormal, or may be determined according to whether all the cluster states collected N times are abnormal.
The manner of determining whether the acquired network state is abnormal may be determining according to whether the network state acquired once is abnormal, or determining according to whether the network states acquired N times are all abnormal.
Wherein N is a natural number greater than or equal to 2.
In the embodiment of the invention, in one acquisition, when all the monitoring indexes of the abnormal controllers reach the standard, the corresponding controllers are judged to be recovered to be normal, and the controller with the lowest comprehensive utilization rate is recovered, wherein the comprehensive utilization rate is associated with one or more of the acquired CPU utilization rate, the acquired memory utilization rate, the acquired equipment connection number and the acquired tunnel service number.
If the monitoring index only comprises the utilization rate of the CPU, the comprehensive utilization rate is the collected utilization rate of the CPU; if the monitoring indexes comprise the utilization rate of the CPU, the utilization rate of the memory, the equipment connection number and the tunnel service number, the comprehensive utilization rate is obtained by weighting and calculating the collected utilization rate of the CPU, the collected utilization rate of the memory, the equipment load rate and the tunnel load rate; the equipment load rate is obtained by dividing the collected equipment connection number by the set connection number; the tunnel load rate is obtained by dividing the collected tunnel service number by the set service number.
The method comprises the following steps of recovering a controller with the lowest comprehensive utilization rate, specifically recovering a slave node with the lowest comprehensive utilization rate; and if the controller with the lowest comprehensive utilization rate is the master node, skipping the corresponding master node, and selecting one slave node with the lowest comprehensive utilization rate from the rest slave nodes for recycling.
In one acquisition, when all the monitoring indexes of the abnormal controller reach the standard, the corresponding controller is determined to be recovered to be normal, for example, taking the monitoring indexes including the utilization rate of the CPU, the utilization rate of the memory, the number of device connections, the number of tunnel services, the process state, the cluster state, and the network state as examples, the standard for determining that the corresponding controller is recovered to be normal is as follows:
the collected CPU utilization rate is less than or equal to the set utilization rate threshold value, and,
the utilization rate of the acquired memory is less than or equal to the set utilization rate threshold, and,
the collected connection number of the equipment is less than or equal to the set connection number, and,
the number of the collected tunnel services is less than or equal to the set number of the services, and,
the collected process state is normal, and,
the collected cluster state is normal, and,
the collected network state is normal.
In the embodiment of the invention, one controller is selected from the slave controller cluster as a master node, and the rest controllers are used as slave nodes; if the abnormal controller is a slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal; and after the slave node is recovered to be normal, the master node continues to send the data message to the slave node in the initial period of sending the data message.
In order to reduce the service burden of the controller system, if an abnormal controller is a slave node, the master node may gradually increase the period of sending the heartbeat message to the corresponding slave node from the initial period to the set maximum period by an increasing step length while stopping sending the data message to the corresponding slave node, and then send the heartbeat message to the corresponding slave node by the maximum period; the initial period of sending the data message to the slave node by the master node may be the same as or different from the initial period of sending the heartbeat message.
And after the slave node is recovered to be normal, the master node continues to send the data message to the slave node in the initial period of sending the data message, and sends the heartbeat message to the slave node in the initial period of sending the heartbeat message.
And if the abnormal controller is the master node, expanding a new controller as a slave node to be added into the controller cluster, selecting one slave node as a new master node by all the slave nodes in the controller cluster, and sending the data message to other slave nodes by the new master node.
In the embodiment of the invention, in order to quickly confirm the abnormal controllers, one controller is selected from the controller cluster as a main node, and the rest controllers are used as slave nodes; if the abnormal controller is the slave node, the master node sends the heartbeat message to the corresponding slave node in a period which is gradually increased from the initial period to the set maximum period in an increasing step length, and then sends the heartbeat message to the corresponding slave node in the maximum period.
If the abnormal controller is the master node, a new controller is expanded to be used as a slave node to be added into the controller cluster, all the slave nodes in the controller cluster select one slave node to be used as the new master node, and the new master node sends heartbeat messages to other slave nodes in an initial period.
The gradual increment can be continuously and sequentially increased or increased at intervals of k times, and k is a natural number greater than 0.
Taking the continuous sequential increment as an example, the gradual increment is represented as:
an initial period; initial period + incremental step; initial period + incremental step size; initial period + incremental step length; … …
For example, before the slave node is abnormal, the master node sends a heartbeat message to the corresponding slave node, and the initial period for sending the heartbeat message is 100 ms; after the slave node is abnormal, the period of the master node sending the heartbeat message to the corresponding slave node is gradually increased from an initial period of 100ms to a set maximum period of 500ms by an incremental step length P × 80ms, and then the heartbeat message is sent to the corresponding slave node by the set maximum period of 500 ms.
Wherein, P is a set utilization rate threshold value.
After the slave node is abnormal, the cycle of sending the heartbeat message to the corresponding slave node by the master node is as follows:
100ms;100ms+P*80ms;100ms+2*P*80ms;100ms+3*P*80ms;……;500ms;500ms;……
and after the slave node is abnormal, a new controller is expanded to be used as the slave node to be added into the controller cluster to bear the service.
Taking k as 1, i.e. 1 increments apart, the stepwise increment is represented as:
an initial period, an initial period; initial period + incremental step; initial period + incremental step; initial period + incremental step length; … …
In the embodiment of the present invention, the incremental step size is positively correlated with the set utilization rate threshold, specifically: the incremental step size may be in a positive linear correlation with the set usage threshold, or the incremental step size may be in a positive non-linear correlation with the set usage threshold.
When the set usage threshold is associated with the collected device connection number and tunnel traffic number of the corresponding controller, the incremental step size is also associated with the collected device connection number and tunnel traffic number of the corresponding controller.
Setting the incremental step length to be related to a set utilization rate threshold value, so that the period of sending the heartbeat message to the corresponding slave node by the master node is dynamically related to the set utilization rate threshold value; when the set utilization rate threshold is associated with the acquired equipment connection number and the acquired tunnel service number of the corresponding controller, the period of sending the heartbeat message to the corresponding slave node by the master node is also associated with the acquired equipment connection number and the acquired tunnel service number of the corresponding controller, so that the waste of resources is prevented, and the controller can quickly confirm when the controller is abnormal. In the embodiment of the invention, if the abnormal slave node returns to normal, the master node returns to send the heartbeat message to the corresponding slave node in the initial period.
And if the abnormal master node is recovered to be normal, the normal recovered master node is used as a new slave node to be added into the controller cluster, and the new master node sends a heartbeat message to the new slave node in an initialization period.
Example 2:
a description is given to a controller cluster node management method with reference to fig. 2 and this embodiment, in this embodiment, a single controller is taken as an example; the monitoring indexes comprise the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state; and a second mode is adopted for judging whether the specified item in the collected monitoring indexes of the controller does not reach the standard: if the specified item in the monitoring index of the controller is abnormal in the continuous N-time collection, judging that the specified item in the collected monitoring index of the controller does not reach the standard; wherein N is a natural number greater than or equal to 2; in this embodiment, each index in the monitoring indexes of a single controller is determined by continuously acquiring N times.
As shown in fig. 2, the deployment scale of the controller cluster is M controllers, where M is a natural number greater than 1; selecting one controller from the M controllers as a main node, and using the rest controllers as slave nodes; the specification data of each controller in the M controllers are consistent, and the method comprises the following steps:
the set equipment number is D;
setting the number of the services as B;
average of initial CPU usage and initial memory usage: pinit
Average of maximum CPU utilization and maximum memory utilization: pmax
Setting a usage threshold P ═ Pinit+(Pmax-Pinit)*[(Bt/B)2+(1-Bt/B)*Dt/D];
When sampling is carried out to the control index of single controller at every turn, the data of gathering simultaneously include: collected device connection number DtAnd the collected tunnel service number BtAnd the collected CPU utilization rate Pt1And the utilization rate P of the acquired memoryt2Collected process state St1Collected cluster state St2And the collected network state St3
Sampling node for monitoring index of single controllerAnd after the data is processed and analyzed, wherein the set utilization rate threshold value can dynamically relate the connection number D of the equipment acquired each timetAnd the collected tunnel service number BtI.e. the set usage threshold is not a fixed value.
The analysis of the simultaneously acquired data proceeds as follows:
comparison DtAnd the size of D, if DtD or less, the abnormal times are Nt1Setting to 0; if D istIf D is greater than D, adding 1 to the abnormal times to obtain the abnormal times Nt1
Comparison BtAnd the size of B, if BtIf B is less than or equal to B, the number of abnormal times is Nt2Setting to 0; if B istIf the number of abnormal times is more than B, adding 1 to the number of abnormal times to obtain the number of abnormal times Nt2
Comparison Pt1And the size of P, if Pt1P or less, the abnormal times are Nt3Setting to 0; if Pt1If the number of abnormal times is more than P, adding 1 to the number of abnormal times to obtain the number of abnormal times Nt3
Comparison Pt2And the size of P, if Pt2P or less, the abnormal times are Nt4Setting to 0; if Pt2If the number of abnormal times is more than P, adding 1 to the number of abnormal times to obtain the number of abnormal times Nt4
Judgment St1If it is abnormal, if St1Normal, abnormal times Nt5Setting to 0; if St1If the abnormal frequency is abnormal, adding 1 to the abnormal frequency to obtain the abnormal frequency Nt5
Judgment St2If it is abnormal, if St2Normal, abnormal times Nt6Setting to 0; if St2If the abnormal frequency is abnormal, adding 1 to the abnormal frequency to obtain the abnormal frequency Nt6
Judgment St3If it is abnormal, if St3Normal, abnormal times Nt7Setting to 0; if St3If the abnormal frequency is abnormal, adding 1 to the abnormal frequency to obtain the abnormal frequency Nt7
Next, if an abnormality occurs in all of the 3 consecutive acquisitions in the specified item in the monitoring index of the controller, it is determined that the specified item in the acquired monitoring index of the controller does not meet the standard.
Then, the number of times that the connection number of the acquired equipment at the A-th time is larger than the set connection number is At1The number of times that the tunnel service number acquired at the A-th time is larger than the set service number is At2The number of times that the utilization rate of the CPU acquired in the A-th time is larger than the set utilization rate threshold is At3The number of times that the utilization rate of the memory acquired at the A-th time is greater than the set utilization rate threshold is At4The abnormal times of the process state acquired at the A-th time are At5The abnormal times of the cluster state acquired at the A-th time are At6The abnormal times of the network state collected at the A-th time is At7
If At1、At2、At3、At4、At5、At6And At7If any one of the values is equal to 3, judging that the corresponding controller is abnormal, and comparing At1、At2、At3、At4、At5、At6And At7And the initialization is set to be 0, and a new controller is expanded to be used as a slave node to join the controller cluster to bear the service.
If the controller with the abnormality is the slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal, and the cycle of sending the heartbeat message to the corresponding slave node by the master node is gradually increased from the initial cycle to the set maximum cycle by the incremental step length and then is sent to the corresponding slave node by the set maximum cycle; in the process, the controller cluster management system continues to sample the corresponding slave nodes, and in one acquisition, when all the monitoring indexes of the corresponding slave nodes reach the standard, the corresponding slave nodes are judged to be recovered to be normal, the master node continues to send data messages to the corresponding slave nodes in the initial period of sending the data messages, and sends heartbeat messages to the corresponding slave nodes in the initial period of sending the heartbeat messages; then, the controller cluster management system redistributes the device connection number and the tunnel service number of each node in a load sharing mode, and returns to the controller cluster management system after the redistribution of the device connection number and the tunnel service number of each node is completed by the controller cluster management systemReceiving a slave node with the lowest comprehensive utilization rate; the comprehensive utilization rate and Dt、Bt、Pt1And Pt2Associating, in particular, assigning Dt/D、Bt/B、Pt1And Pt2And performing weighting calculation on different weights to obtain the comprehensive utilization rate, for example: (D) rate of comprehensive utilizationt/D)*20%+(Bt/B)*20%+Pt1*30%+Pt2*30%。
If At1、At2、At3、At4、At5、At6And At7If all the items in the data are less than 3, continuing to sample the monitoring index of the corresponding controller for the next time; updating A obtained after next sampling analysist1、At2、At3、At4、At5、At6And At7Continuously judging the updated At1、At2、At3、At4、At5、At6And At7Whether any one is equal to 3.
Example 3:
an embodiment of the present invention provides a controller cluster node management apparatus, as shown in fig. 3, including one or more processors 21 and a memory 22. In fig. 3, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, such as the bus connection in fig. 3.
The memory 22, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs and non-volatile computer-executable programs, such as the controller cluster node management method in embodiment 1. The processor 21 executes the controller cluster node management method by executing non-volatile software programs and instructions stored in the memory 22.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the controller cluster node management methods of embodiments 1 and 2 described above, for example, perform the steps shown in fig. 1 and 2 described above.
It should be noted that, for the information interaction, execution process and other contents between the modules and units in the apparatus and system, the specific contents may refer to the description in the embodiment of the method of the present invention because the same concept is used as the embodiment of the processing method of the present invention, and are not described herein again.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (9)

1. A method for managing controller cluster nodes is characterized by comprising the following steps: the controller cluster management system samples the monitoring indexes of each controller and judges whether any one of the collected monitoring indexes of the controllers does not reach the standard; if yes, judging that the corresponding controller is abnormal, and expanding a new controller to be added into the controller cluster; if not, continuing to sample the monitoring index of the corresponding controller;
the monitoring indexes comprise one or more of the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the number of equipment connections, the number of tunnel services, the process state, the cluster state and the network state;
if the collected CPU utilization rate or the memory utilization rate of the controller is larger than a set utilization rate threshold, judging that the collected CPU utilization rate or the memory utilization rate of the controller does not reach the standard; the set utilization rate threshold is associated with the collected device connection number and the tunnel service number of the corresponding controller; the relationship between the set utilization rate threshold and the device connection number and the tunnel service number is represented as follows: p ═ Pinit+(Pmax-Pinit)*[(Bt/B)2+(1-Bt/B)*Dt/D]Wherein P is a set utilization threshold value, PinitIs the average of the initial CPU usage and the initial memory usage, PmaxIs the average of the maximum CPU usage and the maximum memory usage, BtFor the number of tunnel traffic collected, B for the number of traffic set, DtD is the set connection number for the collected equipment connection number.
2. The method according to claim 1, wherein the determining whether the collected CPU utilization or memory utilization of the controller is greater than the set utilization threshold includes:
when the utilization rates of the CPUs of the controllers acquired for N times are all larger than the set utilization rate threshold value, judging that the acquired utilization rates of the CPUs of the controllers are larger than the set utilization rate threshold value;
when the utilization rates of the memories of the controllers collected for N times are all larger than the set utilization rate threshold value, judging that the utilization rate of the collected memories of the controllers is larger than the set utilization rate threshold value;
wherein N is a natural number greater than or equal to 2.
3. The controller cluster node management method of claim 1, wherein any one of a device connection number, a tunnel traffic number, a process state, a cluster state, or a network state does not meet a standard, comprising:
the collected device connection number is larger than the set connection number, or,
the number of the collected tunnel services is larger than the set number of the services, or,
the collected process state is abnormal, or,
the collected cluster state is abnormal, or,
and the collected network state is abnormal.
4. The controller cluster node management method according to claim 1, wherein in one acquisition, when all items in the monitoring indexes of the abnormal controllers reach the standard, it is determined that the corresponding controller is recovered to be normal, and a controller with the lowest comprehensive utilization rate is recovered, wherein the comprehensive utilization rate is associated with one or more of the acquired CPU utilization rate, memory utilization rate, device connection number and tunnel service number; if the monitoring index only comprises the utilization rate of the CPU, the comprehensive utilization rate is the collected utilization rate of the CPU; if the monitoring indexes comprise the utilization rate of the CPU, the utilization rate of the memory, the equipment connection number and the tunnel service number, the comprehensive utilization rate is obtained by weighting and calculating the collected utilization rate of the CPU, the collected utilization rate of the memory, the equipment load rate and the tunnel load rate; the equipment load rate is obtained by dividing the collected equipment connection number by the set connection number; the tunnel load rate is obtained by dividing the collected tunnel service number by the set service number.
5. The controller cluster node management method of claim 1, wherein one controller is elected from the controller cluster as a master node and the remaining controllers are taken as slave nodes; and if the abnormal controller is the slave node, the master node stops sending the data message to the corresponding slave node until the corresponding slave node returns to normal.
6. The controller cluster node management method of claim 1, wherein one controller is elected from the controller cluster as a master node and the remaining controllers are taken as slave nodes; if the abnormal controller is the slave node, the master node gradually increases the period of sending the heartbeat message to the corresponding slave node from the initial period by the incremental step size to the set maximum period and then sends the heartbeat message to the corresponding slave node by the set maximum period.
7. The controller cluster node management method of claim 6, wherein the incremental step size is positively correlated to the set usage threshold.
8. The controller cluster node management method of claim 6, wherein if the abnormal slave node returns to normal, the master node returns to sending the heartbeat packet to the corresponding slave node at an initial period.
9. An apparatus for controller cluster node management, the apparatus comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for performing the controller cluster node management method of any of claims 1-8.
CN202110833833.6A 2021-07-23 2021-07-23 Controller cluster node management method and device Active CN113535517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110833833.6A CN113535517B (en) 2021-07-23 2021-07-23 Controller cluster node management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110833833.6A CN113535517B (en) 2021-07-23 2021-07-23 Controller cluster node management method and device

Publications (2)

Publication Number Publication Date
CN113535517A CN113535517A (en) 2021-10-22
CN113535517B true CN113535517B (en) 2022-04-12

Family

ID=78120562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110833833.6A Active CN113535517B (en) 2021-07-23 2021-07-23 Controller cluster node management method and device

Country Status (1)

Country Link
CN (1) CN113535517B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106559263A (en) * 2016-11-17 2017-04-05 杭州沃趣科技股份有限公司 A kind of improved distributed consensus algorithm
CN109343965A (en) * 2018-10-31 2019-02-15 北京金山云网络技术有限公司 Resource adjusting method, device, cloud platform and server
CN111901422A (en) * 2020-07-28 2020-11-06 浪潮电子信息产业股份有限公司 Method, system and device for managing nodes in cluster

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106559263A (en) * 2016-11-17 2017-04-05 杭州沃趣科技股份有限公司 A kind of improved distributed consensus algorithm
CN109343965A (en) * 2018-10-31 2019-02-15 北京金山云网络技术有限公司 Resource adjusting method, device, cloud platform and server
CN111901422A (en) * 2020-07-28 2020-11-06 浪潮电子信息产业股份有限公司 Method, system and device for managing nodes in cluster

Also Published As

Publication number Publication date
CN113535517A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN106375420B (en) Server cluster intelligent monitoring system and method based on load balancing
US10474504B2 (en) Distributed node intra-group task scheduling method and system
CN113051075B (en) Kubernetes intelligent capacity expansion method and device
CN112231075B (en) Cloud service-based server cluster load balancing control method and system
CN109783157B (en) Method and related device for loading algorithm program
CN111245912B (en) Intelligent building information monitoring method and device, server and intelligent building system
CN112256438B (en) Load balancing control method and device, storage medium and electronic equipment
CN108021447B (en) Method and system for determining optimal resource strategy based on distributed data
CN113992691B (en) Method, device and equipment for distributing edge computing resources and storage medium
CN105703927A (en) Resource allocation method, network device and network system
CN112231108A (en) Task processing method and device, computer readable storage medium and server
CN102480502B (en) I/O load equilibrium method and I/O server
CN110875838A (en) Resource deployment method, device and storage medium
CN113535517B (en) Controller cluster node management method and device
CN109348486A (en) A kind of heterogeneous wireless network resource allocation methods
Ali et al. Probabilistic normed load monitoring in large scale distributed systems using mobile agents
CN105868012A (en) Method and device for processing user request
CN113407340A (en) Service control system, gateway service method, service request forwarding method and device
CN116155911A (en) Version upgrading method and device
CN112636976B (en) Service quality determination method, device, electronic equipment and storage medium
CN116055496B (en) Monitoring data acquisition method and device, electronic equipment and storage medium
CN112532450B (en) Dynamic updating method and system for data stream distribution process configuration
CN112437137B (en) Internet of things data connection method and system
CN113205241A (en) Monitoring data real-time processing method, non-transient readable recording medium and data processing system
CN111352690A (en) Virtual network element management method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant