CN110798375A

CN110798375A - Monitoring method, system and terminal equipment for enhancing high availability of container cluster

Info

Publication number: CN110798375A
Application number: CN201910935666.9A
Authority: CN
Inventors: 张东欧
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-02-14
Anticipated expiration: 2039-09-29
Also published as: CN110798375B

Abstract

The invention discloses a monitoring method, a monitoring system and terminal equipment for enhancing high availability of a container cluster, and relates to the technical field of containers; the monitoring method comprises the following steps: configuring attributes of deployment nodes of kubelet, docker and etcd services, and installing monitoring processes adapted to corresponding services on the deployment nodes based on the node attributes; the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period; when the state information shows that the service is abnormal, restarting kubelet, docker and etcd services; the invention monitors the service states of kubel, docker and etcd at regular time through the monitoring process, ensures high availability of service through a service fault restarting method, alarms abnormal service, informs operation and maintenance personnel to perform necessary manual check recovery, and enhances the high availability of the cluster in a high concurrency scene.

Description

Monitoring method, system and terminal equipment for enhancing high availability of container cluster

Technical Field

The invention belongs to the technical field of containers, and particularly relates to a monitoring method, a monitoring system and a monitoring terminal device for enhancing high availability of a container cluster.

Background

The Container Cluster (Container Cluster) is composed of a plurality of server hosts running kubernets, and different functional components run on different servers respectively to provide Container services for the outside. Kubernetes is used as a management center of container application, manages the life cycles of all containers in a cluster, and realizes high availability of an application layer in the cluster by combining self health check and error recovery mechanisms. High Availability (High Availability) is the percentage of the container cluster that remains up-time; the high availability of container clusters is typically increased by eliminating single points of failure through redundant designs or by reducing the time to failure through monitoring recovery mechanisms. With the development of the IT industry and the rise of the internet business micro-service concept, kubernets (hereinafter referred to as "k 8 s") based container clusters are more and more popular and have become the mainstream preferred practice in the industry at present.

The traditional scheme is that a native k8s is adopted to deploy container cluster services, and high availability of the container cluster is guaranteed by deploying a plurality of Master nodes, so that when the service of any one Master Node fails, other nodes can continue to provide service, the failed Master Node can be successfully recovered, and a multi-copy high-availability cluster is continuously provided.

Most of the platform components of the native cluster are arranged in a container mode, the number of copies and recovery mechanisms of operation can be guaranteed through container configuration information of the platform components, and basic container services for guaranteeing the operation of the platform components are registered as system services and are configured and managed through systems, such as kubelet, etcd and docker services. Generally, when the services are in failure and crash, the system restarts the services according to the configuration parameters; however, in a scenario of large-scale system access, a state in which the service is not broken but the service cannot be normally provided, that is, a so-called dead state, may occur. When such problems occur, the native mechanism is unsolved and restored, so that the whole container cluster is in an abnormal failure state, and the high availability is reduced.

Disclosure of Invention

Aiming at least one defect or improvement requirement in the prior art, the invention provides a monitoring method, a system and terminal equipment for enhancing high availability of a container cluster, wherein monitoring processes are respectively installed on deployment nodes of kubel, docker and etcd services to regularly monitor the service states of the kubel, docker and etcd, and the high availability of the services is ensured by a service failure restarting method; meanwhile, the abnormal service is alarmed to inform operation and maintenance personnel to carry out necessary manual check recovery, so that large-area service interruption is prevented, and high availability of the cluster in a high-concurrency scene is enhanced.

To achieve the above object, according to one aspect of the present invention, there is provided a monitoring method for enhancing high availability of a container cluster, the method comprising the steps of:

s1: configuring attributes of deployment nodes of kubelet, docker and etcd services, and calling an installation deployment tool to install a monitoring process adaptive to the corresponding service on the deployment nodes based on the attributes;

s2: the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period;

s3: and when the state information display service is abnormal, restarting kubelet, docker and etcd services.

Preferably, the monitoring method further includes the following steps:

alarm thresholds of kubbelet, docker and etcd services are respectively configured, and alarm information is generated when the service abnormal duration of any service exceeds the corresponding alarm threshold so as to inform operation and maintenance personnel to carry out necessary manual check and recovery and prevent large-area service interruption.

Preferably, the monitoring method further includes, before performing a restart process on the kubelet, docker, etcd service:

and testing the availability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services.

Preferably, the monitoring method further includes the following steps:

and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.

Preferably, in the monitoring method, the status information of the kubel service includes whether the kubel service is registered locally, whether the kubel service is running, and whether the API detection service is available;

the status information of the docker service comprises whether the docker service is registered in a local machine or not, whether the docker service is running or not and whether the docker service is available or not through detection of a returned value of a docker service command;

the state information of the etcd service comprises whether the etcd service is registered locally, whether an etcd service environment variable file exists, whether the etcd service is operated and whether an API (application program interface) detection service is available.

According to a second aspect of the present invention, there is also provided a monitoring system for enhancing high availability of a container cluster, comprising:

the configuration module is used for appointing the attributes of the deployment nodes of kubbelet, docker and etcd services and calling an installation and deployment tool to install a monitoring module adaptive to the corresponding service on the deployment nodes based on the attributes;

the monitoring module is used for regularly acquiring and recording the state information of the kubelet, docker and etcd services according to a set time period;

and the alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.

Preferably, in the monitoring system, the alarm module is further configured to configure an alarm threshold for kubbelet, docker, and etcd services, and generate alarm information when the service abnormality duration of any service exceeds the corresponding alarm threshold, so as to notify an operation and maintenance worker to perform necessary manual check and recovery, thereby preventing large-area service interruption.

Preferably, in the monitoring system, the alarm module is further configured to control the monitoring module to test the availability of the kubelet, docker, and etcd services before restarting processing, and restart processing is performed through a restart instruction provided by the kubelet, docker, and etcd services if a test result still shows that the services are abnormal.

Preferably, the monitoring system further comprises an inspection module and a visualization module;

the checking module is used for regularly checking whether the monitoring process of the monitoring module runs according to a set time period, and if the monitoring process does not run, the monitoring process is run again;

the visualization module is used for receiving the alarm information generated by the alarm module and performing visualization display.

According to a third aspect of the present invention, there is also provided a terminal device, comprising at least one processing unit, and at least one memory unit,

wherein the storage unit stores a computer program which, when executed by the processing unit, causes the processing unit to carry out the steps of the monitoring method of any one of the above.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) according to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, the monitoring processes are respectively installed on the deployment nodes of the kubelet, docker and etcd services to monitor the service states of the kubelet, docker and etcd at regular time, and the high availability of the services is ensured by a method of restarting service faults; meanwhile, the abnormal service is alarmed to inform operation and maintenance personnel to carry out necessary manual check recovery, so that large-area service interruption is prevented, and high availability of the cluster in a high-concurrency scene is enhanced.

(2) According to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, when the monitoring process detects the abnormal service, the restarting processing is not immediately carried out, but the detection result of the monitoring process is verified through multiple service availability tests, if the results of the multiple continuous tests indicate that the service is abnormal, the restarting service is executed, so that the deviation of the collected service running state caused by the fault of the monitoring process is effectively prevented, the system executes unnecessary restarting operation, and the system resource is wasted.

(3) According to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, the inspection script is started regularly to check whether the monitoring process runs or not, and if the monitoring process does not run, the monitoring process is run again, so that the monitoring process is prevented from being failed, and the normal service monitoring function is prevented from being influenced.

Drawings

FIG. 1 is a flow chart of a monitoring method for enhancing high availability of a container cluster according to an embodiment of the present invention;

FIG. 2 is a logic block diagram of a monitoring system for enhancing high availability of a container cluster according to an embodiment of the present invention;

fig. 3 is a schematic composition diagram of a specific implementation manner of the monitoring system according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Fig. 1 is a flowchart of a monitoring method for enhancing high availability of a container cluster according to the present embodiment, and referring to fig. 1, the monitoring method includes the following steps:

s1: defining attributes of deployment nodes of kubelet, docker and etcd services, and calling an installation and deployment tool to install a monitoring process adaptive to the corresponding service on the deployment nodes based on the node attributes;

because three components of kubel, docker and etcd need to cooperate together to provide pod service, the scheme mainly monitors the three components which cannot be managed by pod, and the rest platform components can realize high-availability service through probe functions (probes) of the pod. The kubel and docker components are deployed in all nodes (masters and nodes), and the etcd component may be configured in the Master Node or the Node nodes according to different deployment configurations. Therefore, when using the ansable deployment k8s cluster, an etcd node needs to be defined, i.e., the node that deploys the etcd service.

Since different deployment nodes are configured with different types of services, for example: one Master node is configured with two components of kubbelet and docker, and the other Master node is simultaneously configured with three components of kubbelet, docker and etcd; therefore, in this embodiment, the attribute of each deployment node is defined first, and then an installation deployment tool provided by the system is called to install, deploy, and configure the corresponding monitoring process according to the attributes of different deployment nodes, so as to detect the service running state of the component of the corresponding deployment node.

after the monitoring process is installed and started, state information of kubel, docker and etcd services can be collected at regular time according to a set time period; the time period of the acquisition can be set according to the requirement, the monitoring process is defaulted to carry out monitoring acquisition every 10 seconds, and when any one component in kubel, docker and etcd cannot be served, the monitoring process records the state information of the corresponding component as service abnormity.

Wherein the status information of the kubel service includes:

(1) detecting whether the kubel service is registered locally;

(2) detecting whether the kubel service is running;

(3) detecting whether the service is available through a kubelet service health check api;

the state information of the docker service includes:

(1) detecting whether the docker service is registered in the local machine;

(2) detecting whether a docker service is running;

(3) detecting whether the data is available or not through a returned value of a probe docker service command;

the state information of the etcd service includes:

(1) detecting whether the etcd service is registered in the local machine;

(2) detecting whether an etcd service environment variable file exists or not;

(3) detecting whether the etcd service is running;

(4) whether the service is available is detected by the etcd service health check API.

S3: when the state information recorded by the monitoring process shows that the service is abnormal, restarting processing is carried out on kubelet, docker and etcd services, and high availability of the service is ensured by a service failure restarting method.

As a preferable example of the present embodiment, step S3 further includes:

alarm thresholds of kubbelet, docker and etcd services are configured in advance, and when the service abnormal duration of any one component exceeds the corresponding alarm threshold, alarm information is generated to inform operation and maintenance personnel to perform necessary manual check recovery, so that large-area service interruption is prevented. In the embodiment, the duration of the abnormal service is used as an alarm index, the alarm threshold is set to be 1 minute, and when the monitored service is continuously in an abnormal state within 1 minute, alarm information is generated to inform operation and maintenance personnel to perform necessary manual check recovery.

As a preferable example of this embodiment, before performing the restart process on the kubelet, docker, etcd service, the method further includes: and testing the usability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services. In the embodiment, the monitoring process is controlled to continuously execute 3 times of service availability tests, and if the service is abnormal in all the 3 times of continuous tests, the restarting operation is executed;

in the embodiment, when the monitoring process detects that the service is abnormal, the monitoring process does not immediately restart, but checks the detection result of the monitoring process through multiple service availability tests, and executes the restarting service if the results of the multiple continuous tests indicate that the service is abnormal, so that the situation that the collected service running state is deviated due to the fault of the monitoring process, the system executes unnecessary restarting operation and wastes system resources is effectively prevented.

As a preferable example of this embodiment, in order to prevent the monitoring process from being abnormal, the monitoring method further includes: and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.

In the embodiment, the checking script and the timing task are set, the checking script is started regularly according to the timing task to check whether the monitoring process runs, and if the monitoring process does not run, the monitoring process is run again, so that the monitoring process is prevented from being failed and the normal service monitoring function is prevented from being influenced.

Because kubelet, docker and etcd are clustered, the single node failure does not affect the overall service, and therefore the cluster service function is not affected during the detection and recovery of the node service failure. Under the condition of single node failure, the cluster service function is normal; in the case of multiple node failures, the cluster service can quickly recover to normal in a short time.

The embodiment also provides a system capable of implementing the monitoring method for enhancing the high availability of the container cluster, and as shown in fig. 2, the system includes a configuration module, a monitoring module and an alarm module; wherein the content of the first and second substances,

the configuration module is used for appointing the attribute of a deployment node of kubbelet, docker and etcd services, and calling an installation deployment tool to install a monitoring module adaptive to the corresponding service on the deployment node based on the appointed node attribute; the monitoring module is used for running a monitoring process. The configuration module firstly defines the attribute of each deployment node, then calls an installation deployment tool provided by the system to install, deploy and configure the monitoring module with the corresponding monitoring process according to the attribute of the different deployment nodes, and detects the service running state of the component of the corresponding deployment node through the monitoring process.

The monitoring module is used for regularly acquiring and recording the state information of the kubbelet, docker and etcd services according to a set time period and reporting the state information to the alarm module; when any one of the components in the kubelet, docker and etcd cannot be served, the monitoring process records the state information of the corresponding component as service exception.

The alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.

Further preferably, the alarm module is further configured to pre-configure alarm thresholds for kubbelet, docker, and etcd services, and generate alarm information when the service abnormal duration of any service exceeds the corresponding alarm threshold, so as to notify operation and maintenance personnel to perform necessary manual check recovery, thereby preventing large-area service interruption. In this embodiment, the alarm module uses the duration of the abnormal service as an alarm index, the alarm threshold is set to 1 minute, and when the monitored service continuously maintains the abnormal state within 1 minute, alarm information is generated to notify the operation and maintenance personnel to perform necessary manual check and recovery.

Further preferably, the alarm module is further configured to control the monitoring module to test the availability of the kubelet, docker, and etcd services before the restart processing, and perform the restart processing through a restart instruction provided by the kubelet, docker, and etcd services if the test result still shows that the services are abnormal.

In the embodiment, the alarm module controls the monitoring module to continuously execute 3 service availability tests, and if the service is abnormal in all the 3 continuous tests, the restarting operation is executed; therefore, the situation that the collected service running state deviates due to the fault of the monitoring module, unnecessary restarting operation is executed, and system resources are wasted is effectively prevented.

As a preferable preference of this embodiment, the monitoring system further includes an inspection module and a visualization module; wherein the content of the first and second substances,

the checking module is used for regularly running the checking script according to a set time period to detect whether the monitoring process of the monitoring module runs or not, and if the monitoring process does not run, the monitoring process is run again to prevent the monitoring module from being out of order and influencing the normal service monitoring function.

The visualization module is used for receiving the alarm information generated by the alarm module, visually displaying the alarm information, prompting an alarm on a system interface and informing operation and maintenance personnel to check as soon as possible.

All or part of the modules of the monitoring system can be realized by software, hardware and a combination thereof, and can be embedded in a processor of a computer device or independent of the processor in a hardware form, or can be stored in a memory of the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 3 is a specific implementation manner of the monitoring system provided in this embodiment, and as shown in fig. 3, the monitoring system includes Watchdog, Node extractor, Prometheus, Check Watchdog, and Grafana;

the Watchdog is mainly used for regularly acquiring and recording state information of kubbelet, docker and etcd services according to a set time period and reporting the state information to Prometheus;

the Prometheus is mainly responsible for processing the state information collected by the Watchdog and judging whether to generate an alarm or not; when the service state information of any component shows that the service is abnormal, the Prometous restarts the kubel, docker and etcd services; and the Prometheus takes the duration of the abnormal service as an alarm index, and generates alarm information when the duration of the abnormal service of the kubbelet, docker and etcd services exceeds a preset alarm threshold value so as to inform operation and maintenance personnel to perform necessary manual check recovery and prevent large-area service interruption.

In addition, before the Prometheus executes the restarting processing, the Watchdog is controlled to test the availability of the kubelet, docker and etcd services, and if the test result still shows that the services are abnormal, the restarting processing is executed through a restarting instruction provided by the kubelet, docker and etcd services.

A Node exporter is used as a communication Node between the Watchdog and the Prometous, is made of a Prometous _ client library and is mainly used for registering a dashboard monitoring index, starting a promethues client service, calling the Watchdog according to a set period to sequentially check the service available state of each component and recording the obtained service available state into monitoring; the service is restarted if the three checks fail.

Grafana is mainly used for visually displaying alarm information generated by Prometoxus and prompting alarms on a system interface.

The Check Watchdog is used for regularly running the Check script to detect whether the monitoring process of the Watchdog runs, and if the monitoring process does not run, the monitoring process is run again to prevent the Watchdog from being out of order.

The present embodiment also provides a terminal device, which includes at least one processor and at least one memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is enabled to execute the steps of the monitoring method. The type of processor and memory are not particularly limited, for example: the processor may be a microprocessor, digital information processor, on-chip programmable logic system, or the like; the memory may be volatile memory, non-volatile memory, a combination thereof, or the like.

The present embodiment also provides a computer-readable medium, which stores a computer program executable by a terminal device, and when the computer program runs on the terminal device, the computer program causes the terminal device to execute the steps of the monitoring method. Types of computer readable media include, but are not limited to, storage media such as SD cards, usb disks, fixed hard disks, removable hard disks, and the like.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A monitoring method for enhancing high availability of a container cluster, comprising the steps of:

s1: configuring attributes of deployment nodes of kubelet, docker and etcd services, and installing monitoring processes adapted to corresponding services on the deployment nodes based on the attributes;

2. The monitoring method of claim 1, further comprising the steps of:

alarm thresholds of kubbelet, docker and etcd services are respectively configured, and alarm information is generated when the service abnormal duration of any service exceeds the corresponding alarm threshold.

3. The monitoring method of claim 1, further comprising, prior to restarting kubelet, docker, etcd services:

4. The monitoring method of claim 1, further comprising the steps of:

5. The monitoring method of claim 2, wherein the status information of the kubel service includes whether the kubel service is registered locally, whether the kubel service is running, whether an API detects that a service is available;

6. A monitoring system for enhancing high availability of a container cluster, comprising:

the configuration module is used for appointing the attributes of the deployment nodes of kubbelet, docker and etcd services and installing a monitoring module adapted to the corresponding service on the deployment nodes based on the attributes;

7. The monitoring system of claim 6, wherein the alarm module is further configured to configure alarm thresholds for kubelet, docker, etcd services, and to generate alarm information when a service anomaly duration for any service exceeds a corresponding alarm threshold.

8. The monitoring system of claim 7, wherein the alarm module is further configured to control the monitoring module to test availability of kubelet, docker, and etcd services before restarting processing, and if the test result still indicates that the services are abnormal, restart processing is performed through a restart instruction provided by the kubelet, docker, and etcd services.

9. The monitoring system of claim 7, further comprising an inspection module and a visualization module;

10. A terminal device, characterized in that it comprises at least one processing unit, and at least one memory unit,

wherein the storage unit stores a computer program which, when executed by the processing unit, causes the processing unit to carry out the steps of the monitoring method according to any one of claims 1 to 5.