CN110798375B - Monitoring method, system and terminal equipment for enhancing high availability of container cluster - Google Patents

Monitoring method, system and terminal equipment for enhancing high availability of container cluster Download PDF

Info

Publication number
CN110798375B
CN110798375B CN201910935666.9A CN201910935666A CN110798375B CN 110798375 B CN110798375 B CN 110798375B CN 201910935666 A CN201910935666 A CN 201910935666A CN 110798375 B CN110798375 B CN 110798375B
Authority
CN
China
Prior art keywords
service
docker
etcd
services
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910935666.9A
Other languages
Chinese (zh)
Other versions
CN110798375A (en
Inventor
张东欧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fiberhome Telecommunication Technologies Co Ltd
Original Assignee
Fiberhome Telecommunication Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fiberhome Telecommunication Technologies Co Ltd filed Critical Fiberhome Telecommunication Technologies Co Ltd
Priority to CN201910935666.9A priority Critical patent/CN110798375B/en
Publication of CN110798375A publication Critical patent/CN110798375A/en
Application granted granted Critical
Publication of CN110798375B publication Critical patent/CN110798375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0823Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a monitoring method, a monitoring system and terminal equipment for enhancing high availability of a container cluster, and relates to the technical field of containers; the monitoring method comprises the following steps: configuring attributes of deployment nodes of kubelet, docker and etcd services, and installing monitoring processes adapted to corresponding services on the deployment nodes based on the node attributes; the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period; when the state information shows that the service is abnormal, restarting kubelet, docker and etcd services; the invention monitors the service states of kubel, docker and etcd at regular time through the monitoring process, ensures high availability of service through a service fault restarting method, alarms abnormal service, informs operation and maintenance personnel to perform necessary manual check recovery, and enhances the high availability of the cluster in a high concurrency scene.

Description

Monitoring method, system and terminal equipment for enhancing high availability of container cluster
Technical Field
The invention belongs to the technical field of containers, and particularly relates to a monitoring method, a monitoring system and a monitoring terminal device for enhancing high availability of a container cluster.
Background
The Container Cluster (Container Cluster) is composed of a plurality of server hosts running kubernets, and different functional components run on different servers respectively to provide Container services for the outside. Kubernetes is used as a management center of container application, manages the life cycles of all containers in a cluster, and realizes high availability of an application layer in the cluster by combining self health check and error recovery mechanisms. High Availability (High Availability) is the percentage of the container cluster that remains up-time; the high availability of container clusters is typically increased by eliminating single points of failure through redundant designs or by reducing the time to failure through monitoring recovery mechanisms. With the development of the IT industry and the rise of the internet business micro-service concept, kubernets (hereinafter referred to as "k 8 s") based container clusters are more and more popular and have become the mainstream preferred practice in the industry at present.
The traditional scheme is that a native k8s is adopted to deploy container cluster services, and high availability of the container cluster is guaranteed by deploying a plurality of Master nodes, so that when the service of any one Master Node fails, other nodes can continue to provide service, the failed Master Node can be successfully recovered, and a multi-copy high-availability cluster is continuously provided.
Most of the platform components of the native cluster are arranged in a container mode, the number of copies and recovery mechanisms of operation can be guaranteed through container configuration information of the platform components, and basic container services for guaranteeing the operation of the platform components are registered as system services and are configured and managed through systems, such as kubelet, etcd and docker services. Generally, when the services are in failure and crash, the system restarts the services according to the configuration parameters; however, in a scenario of large-scale system access, a state in which the service is not broken but the service cannot be normally provided, that is, a so-called dead state, may occur. When such problems occur, the native mechanism is unsolved and restored, so that the whole container cluster is in an abnormal failure state, and the high availability is reduced.
Disclosure of Invention
Aiming at least one defect or improvement requirement in the prior art, the invention provides a monitoring method, a system and terminal equipment for enhancing high availability of a container cluster, wherein monitoring processes are respectively installed on deployment nodes of kubel, docker and etcd services to regularly monitor the service states of the kubel, docker and etcd, and the high availability of the services is ensured by a service failure restarting method; meanwhile, the abnormal service is alarmed to inform operation and maintenance personnel to carry out necessary manual check recovery, so that large-area service interruption is prevented, and high availability of the cluster in a high-concurrency scene is enhanced.
To achieve the above object, according to one aspect of the present invention, there is provided a monitoring method for enhancing high availability of a container cluster, the method comprising the steps of:
s1: configuring attributes of deployment nodes of kubelet, docker and etcd services, and calling an installation deployment tool to install a monitoring process adaptive to the corresponding service on the deployment nodes based on the attributes;
s2: the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period;
s3: and when the state information display service is abnormal, restarting kubelet, docker and etcd services.
Preferably, the monitoring method further includes the following steps:
alarm thresholds of kubbelet, docker and etcd services are respectively configured, and alarm information is generated when the service abnormal duration of any service exceeds the corresponding alarm threshold so as to inform operation and maintenance personnel to carry out necessary manual check and recovery and prevent large-area service interruption.
Preferably, the monitoring method further includes, before performing a restart process on the kubelet, docker, etcd service:
and testing the availability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services.
Preferably, the monitoring method further includes the following steps:
and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.
Preferably, in the monitoring method, the status information of the kubel service includes whether the kubel service is registered locally, whether the kubel service is running, and whether the API detection service is available;
the status information of the docker service comprises whether the docker service is registered in a local machine or not, whether the docker service is running or not and whether the docker service is available or not through detection of a returned value of a docker service command;
the state information of the etcd service comprises whether the etcd service is registered locally, whether an etcd service environment variable file exists, whether the etcd service is operated and whether an API (application program interface) detection service is available.
According to a second aspect of the present invention, there is also provided a monitoring system for enhancing high availability of a container cluster, comprising:
the configuration module is used for appointing the attributes of the deployment nodes of kubbelet, docker and etcd services and calling an installation and deployment tool to install a monitoring module adaptive to the corresponding service on the deployment nodes based on the attributes;
the monitoring module is used for regularly acquiring and recording the state information of the kubelet, docker and etcd services according to a set time period;
and the alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.
Preferably, in the monitoring system, the alarm module is further configured to configure an alarm threshold for kubbelet, docker, and etcd services, and generate alarm information when the service abnormality duration of any service exceeds the corresponding alarm threshold, so as to notify an operation and maintenance worker to perform necessary manual check and recovery, thereby preventing large-area service interruption.
Preferably, in the monitoring system, the alarm module is further configured to control the monitoring module to test the availability of the kubelet, docker, and etcd services before restarting processing, and restart processing is performed through a restart instruction provided by the kubelet, docker, and etcd services if a test result still shows that the services are abnormal.
Preferably, the monitoring system further comprises an inspection module and a visualization module;
the checking module is used for regularly checking whether the monitoring process of the monitoring module runs according to a set time period, and if the monitoring process does not run, the monitoring process is run again;
the visualization module is used for receiving the alarm information generated by the alarm module and performing visualization display.
According to a third aspect of the present invention, there is also provided a terminal device, comprising at least one processing unit, and at least one memory unit,
wherein the storage unit stores a computer program which, when executed by the processing unit, causes the processing unit to carry out the steps of the monitoring method of any one of the above.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) according to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, the monitoring processes are respectively installed on the deployment nodes of the kubelet, docker and etcd services to monitor the service states of the kubelet, docker and etcd at regular time, and the high availability of the services is ensured by a method of restarting service faults; meanwhile, the abnormal service is alarmed to inform operation and maintenance personnel to carry out necessary manual check recovery, so that large-area service interruption is prevented, and high availability of the cluster in a high-concurrency scene is enhanced.
(2) According to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, when the monitoring process detects the abnormal service, the restarting processing is not immediately carried out, but the detection result of the monitoring process is verified through multiple service availability tests, if the results of the multiple continuous tests indicate that the service is abnormal, the restarting service is executed, so that the deviation of the collected service running state caused by the fault of the monitoring process is effectively prevented, the system executes unnecessary restarting operation, and the system resource is wasted.
(3) According to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, the inspection script is started regularly to check whether the monitoring process runs or not, and if the monitoring process does not run, the monitoring process is run again, so that the monitoring process is prevented from being failed, and the normal service monitoring function is prevented from being influenced.
Drawings
FIG. 1 is a flow chart of a monitoring method for enhancing high availability of a container cluster according to an embodiment of the present invention;
FIG. 2 is a logic block diagram of a monitoring system for enhancing high availability of a container cluster according to an embodiment of the present invention;
fig. 3 is a schematic composition diagram of a specific implementation manner of the monitoring system according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Fig. 1 is a flowchart of a monitoring method for enhancing high availability of a container cluster according to the present embodiment, and referring to fig. 1, the monitoring method includes the following steps:
s1: defining attributes of deployment nodes of kubelet, docker and etcd services, and calling an installation and deployment tool to install a monitoring process adaptive to the corresponding service on the deployment nodes based on the node attributes;
because three components of kubel, docker and etcd need to cooperate together to provide pod service, the scheme mainly monitors the three components which cannot be managed by pod, and the rest platform components can realize high-availability service through probe functions (probes) of the pod. The kubel and docker components are deployed in all nodes (masters and nodes), and the etcd component may be configured in the Master Node or the Node nodes according to different deployment configurations. Therefore, when using the ansable deployment k8s cluster, an etcd node needs to be defined, i.e., the node that deploys the etcd service.
Since different deployment nodes are configured with different types of services, for example: one Master node is configured with two components of kubbelet and docker, and the other Master node is simultaneously configured with three components of kubbelet, docker and etcd; therefore, in this embodiment, the attribute of each deployment node is defined first, and then an installation deployment tool provided by the system is called to install, deploy, and configure the corresponding monitoring process according to the attributes of different deployment nodes, so as to detect the service running state of the component of the corresponding deployment node.
S2: the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period;
after the monitoring process is installed and started, state information of kubel, docker and etcd services can be collected at regular time according to a set time period; the time period of the acquisition can be set according to the requirement, the monitoring process is defaulted to carry out monitoring acquisition every 10 seconds, and when any one component in kubel, docker and etcd cannot be served, the monitoring process records the state information of the corresponding component as service abnormity.
Wherein the status information of the kubel service includes:
(1) detecting whether the kubel service is registered locally;
(2) detecting whether the kubel service is running;
(3) detecting whether the service is available through a kubelet service health check api;
the state information of the docker service includes:
(1) detecting whether the docker service is registered in the local machine;
(2) detecting whether a docker service is running;
(3) detecting whether the data is available or not through a returned value of a probe docker service command;
the state information of the etcd service includes:
(1) detecting whether the etcd service is registered in the local machine;
(2) detecting whether an etcd service environment variable file exists or not;
(3) detecting whether the etcd service is running;
(4) whether the service is available is detected by the etcd service health check API.
S3: when the state information recorded by the monitoring process shows that the service is abnormal, restarting processing is carried out on kubelet, docker and etcd services, and high availability of the service is ensured by a service failure restarting method.
As a preferable example of the present embodiment, step S3 further includes:
alarm thresholds of kubbelet, docker and etcd services are configured in advance, and when the service abnormal duration of any one component exceeds the corresponding alarm threshold, alarm information is generated to inform operation and maintenance personnel to perform necessary manual check recovery, so that large-area service interruption is prevented. In the embodiment, the duration of the abnormal service is used as an alarm index, the alarm threshold is set to be 1 minute, and when the monitored service is continuously in an abnormal state within 1 minute, alarm information is generated to inform operation and maintenance personnel to perform necessary manual check recovery.
As a preferable example of this embodiment, before performing the restart process on the kubelet, docker, etcd service, the method further includes: and testing the usability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services. In the embodiment, the monitoring process is controlled to continuously execute 3 times of service availability tests, and if the service is abnormal in all the 3 times of continuous tests, the restarting operation is executed;
in the embodiment, when the monitoring process detects that the service is abnormal, the monitoring process does not immediately restart, but checks the detection result of the monitoring process through multiple service availability tests, and executes the restarting service if the results of the multiple continuous tests indicate that the service is abnormal, so that the situation that the collected service running state is deviated due to the fault of the monitoring process, the system executes unnecessary restarting operation and wastes system resources is effectively prevented.
As a preferable example of this embodiment, in order to prevent the monitoring process from being abnormal, the monitoring method further includes: and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.
In the embodiment, the checking script and the timing task are set, the checking script is started regularly according to the timing task to check whether the monitoring process runs, and if the monitoring process does not run, the monitoring process is run again, so that the monitoring process is prevented from being failed and the normal service monitoring function is prevented from being influenced.
Because kubelet, docker and etcd are clustered, the single node failure does not affect the overall service, and therefore the cluster service function is not affected during the detection and recovery of the node service failure. Under the condition of single node failure, the cluster service function is normal; in the case of multiple node failures, the cluster service can quickly recover to normal in a short time.
The embodiment also provides a system capable of implementing the monitoring method for enhancing the high availability of the container cluster, and as shown in fig. 2, the system includes a configuration module, a monitoring module and an alarm module; wherein,
the configuration module is used for appointing the attribute of a deployment node of kubbelet, docker and etcd services, and calling an installation deployment tool to install a monitoring module adaptive to the corresponding service on the deployment node based on the appointed node attribute; the monitoring module is used for running a monitoring process. The configuration module firstly defines the attribute of each deployment node, then calls an installation deployment tool provided by the system to install, deploy and configure the monitoring module with the corresponding monitoring process according to the attribute of the different deployment nodes, and detects the service running state of the component of the corresponding deployment node through the monitoring process.
The monitoring module is used for regularly acquiring and recording the state information of the kubbelet, docker and etcd services according to a set time period and reporting the state information to the alarm module; when any one of the components in the kubelet, docker and etcd cannot be served, the monitoring process records the state information of the corresponding component as service exception.
The alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.
Further preferably, the alarm module is further configured to pre-configure alarm thresholds for kubbelet, docker, and etcd services, and generate alarm information when the service abnormal duration of any service exceeds the corresponding alarm threshold, so as to notify operation and maintenance personnel to perform necessary manual check recovery, thereby preventing large-area service interruption. In this embodiment, the alarm module uses the duration of the abnormal service as an alarm index, the alarm threshold is set to 1 minute, and when the monitored service continuously maintains the abnormal state within 1 minute, alarm information is generated to notify the operation and maintenance personnel to perform necessary manual check and recovery.
Further preferably, the alarm module is further configured to control the monitoring module to test the availability of the kubelet, docker, and etcd services before the restart processing, and perform the restart processing through a restart instruction provided by the kubelet, docker, and etcd services if the test result still shows that the services are abnormal.
In the embodiment, the alarm module controls the monitoring module to continuously execute 3 service availability tests, and if the service is abnormal in all the 3 continuous tests, the restarting operation is executed; therefore, the situation that the collected service running state deviates due to the fault of the monitoring module, unnecessary restarting operation is executed, and system resources are wasted is effectively prevented.
As a preferable preference of this embodiment, the monitoring system further includes an inspection module and a visualization module; wherein,
the checking module is used for regularly running the checking script according to a set time period to detect whether the monitoring process of the monitoring module runs or not, and if the monitoring process does not run, the monitoring process is run again to prevent the monitoring module from being out of order and influencing the normal service monitoring function.
The visualization module is used for receiving the alarm information generated by the alarm module, visually displaying the alarm information, prompting an alarm on a system interface and informing operation and maintenance personnel to check as soon as possible.
All or part of the modules of the monitoring system can be realized by software, hardware and a combination thereof, and can be embedded in a processor of a computer device or independent of the processor in a hardware form, or can be stored in a memory of the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 3 is a specific implementation manner of the monitoring system provided in this embodiment, and as shown in fig. 3, the monitoring system includes Watchdog, Node extractor, Prometheus, Check Watchdog, and Grafana;
the Watchdog is mainly used for regularly acquiring and recording state information of kubbelet, docker and etcd services according to a set time period and reporting the state information to Prometheus;
the Prometheus is mainly responsible for processing the state information collected by the Watchdog and judging whether to generate an alarm or not; when the service state information of any component shows that the service is abnormal, the Prometous restarts the kubel, docker and etcd services; and the Prometheus takes the duration of the abnormal service as an alarm index, and generates alarm information when the duration of the abnormal service of the kubbelet, docker and etcd services exceeds a preset alarm threshold value so as to inform operation and maintenance personnel to perform necessary manual check recovery and prevent large-area service interruption.
In addition, before the Prometheus executes the restarting processing, the Watchdog is controlled to test the availability of the kubelet, docker and etcd services, and if the test result still shows that the services are abnormal, the restarting processing is executed through a restarting instruction provided by the kubelet, docker and etcd services.
A Node exporter is used as a communication Node between the Watchdog and the Prometous, is made of a Prometous _ client library and is mainly used for registering a dashboard monitoring index, starting a promethues client service, calling the Watchdog according to a set period to sequentially check the service available state of each component and recording the obtained service available state into monitoring; the service is restarted if the three checks fail.
Grafana is mainly used for visually displaying alarm information generated by Prometoxus and prompting alarms on a system interface.
The Check Watchdog is used for regularly running the Check script to detect whether the monitoring process of the Watchdog runs, and if the monitoring process does not run, the monitoring process is run again to prevent the Watchdog from being out of order.
The present embodiment also provides a terminal device, which includes at least one processor and at least one memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is enabled to execute the steps of the monitoring method. The type of processor and memory are not particularly limited, for example: the processor may be a microprocessor, digital information processor, on-chip programmable logic system, or the like; the memory may be volatile memory, non-volatile memory, a combination thereof, or the like.
The present embodiment also provides a computer-readable medium, which stores a computer program executable by a terminal device, and when the computer program runs on the terminal device, the computer program causes the terminal device to execute the steps of the monitoring method. Types of computer readable media include, but are not limited to, storage media such as SD cards, usb disks, fixed hard disks, removable hard disks, and the like.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A monitoring method for enhancing high availability of a container cluster, comprising the steps of:
s1: configuring attributes of deployment nodes of kubbelet, docker and etcd services, and installing a monitoring process adaptive to corresponding services on the deployment nodes based on the attributes so as to detect service running states of components of the corresponding deployment nodes;
s2: the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period;
s3: and when the state information display service is abnormal, restarting kubelet, docker and etcd services.
2. The monitoring method of claim 1, further comprising the steps of:
alarm thresholds of kubbelet, docker and etcd services are respectively configured, and alarm information is generated when the service abnormal duration of any service exceeds the corresponding alarm threshold.
3. The monitoring method of claim 1, further comprising, prior to restarting kubelet, docker, etcd services:
and testing the availability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services.
4. The monitoring method of claim 1, further comprising the steps of:
and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.
5. The monitoring method of claim 2, wherein the status information of the kubel service includes whether the kubel service is registered locally, whether the kubel service is running, whether an API detects that a service is available;
the status information of the docker service comprises whether the docker service is registered in a local machine or not, whether the docker service is running or not and whether the docker service is available or not through detection of a returned value of a docker service command;
the state information of the etcd service comprises whether the etcd service is registered locally, whether an etcd service environment variable file exists, whether the etcd service is operated and whether an API (application program interface) detection service is available.
6. A monitoring system for enhancing high availability of a container cluster, comprising:
the configuration module is used for appointing attributes of deployment nodes of kubbelet, docker and etcd services, and installing a monitoring module adapted to corresponding services on the deployment nodes based on the attributes so as to detect the service running state of components of the corresponding deployment nodes;
the monitoring module is used for regularly acquiring and recording the state information of the kubelet, docker and etcd services according to a set time period;
and the alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.
7. The monitoring system of claim 6, wherein the alarm module is further configured to configure alarm thresholds for kubelet, docker, etcd services, and to generate alarm information when a service anomaly duration for any service exceeds a corresponding alarm threshold.
8. The monitoring system of claim 7, wherein the alarm module is further configured to control the monitoring module to test availability of kubelet, docker, and etcd services before restarting processing, and if the test result still indicates that the services are abnormal, restart processing is performed through a restart instruction provided by the kubelet, docker, and etcd services.
9. The monitoring system of claim 7, further comprising an inspection module and a visualization module;
the checking module is used for regularly checking whether the monitoring process of the monitoring module runs according to a set time period, and if the monitoring process does not run, the monitoring process is run again;
the visualization module is used for receiving the alarm information generated by the alarm module and performing visualization display.
10. A terminal device, characterized in that it comprises at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program which, when executed by the processing unit, causes the processing unit to carry out the steps of the monitoring method according to any one of claims 1 to 5.
CN201910935666.9A 2019-09-29 2019-09-29 Monitoring method, system and terminal equipment for enhancing high availability of container cluster Active CN110798375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910935666.9A CN110798375B (en) 2019-09-29 2019-09-29 Monitoring method, system and terminal equipment for enhancing high availability of container cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910935666.9A CN110798375B (en) 2019-09-29 2019-09-29 Monitoring method, system and terminal equipment for enhancing high availability of container cluster

Publications (2)

Publication Number Publication Date
CN110798375A CN110798375A (en) 2020-02-14
CN110798375B true CN110798375B (en) 2021-10-01

Family

ID=69438676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910935666.9A Active CN110798375B (en) 2019-09-29 2019-09-29 Monitoring method, system and terminal equipment for enhancing high availability of container cluster

Country Status (1)

Country Link
CN (1) CN110798375B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111338871A (en) * 2020-02-27 2020-06-26 苏州浪潮智能科技有限公司 Distributed file system Qzone high availability test method, system, equipment and storage medium
CN111752805A (en) * 2020-07-01 2020-10-09 浪潮云信息技术股份公司 Cloud server resource monitoring and warning system
CN111984366B (en) * 2020-07-24 2023-01-06 苏州浪潮智能科技有限公司 Method and system for containerized deployment of disaster recovery mechanism
CN112162821B (en) * 2020-09-25 2022-04-26 中国电力科学研究院有限公司 Container cluster resource monitoring method, device and system
CN112214323B (en) * 2020-10-12 2022-06-14 苏州浪潮智能科技有限公司 Resource recovery method and device and computer readable storage medium
US11397632B2 (en) 2020-10-30 2022-07-26 Red Hat, Inc. Safely recovering workloads within a finite timeframe from unhealthy cluster nodes
CN112769922B (en) * 2020-12-31 2022-07-12 南京视察者智能科技有限公司 Device and method for self-starting micro service cluster
CN112769652B (en) * 2021-01-14 2022-12-16 苏州浪潮智能科技有限公司 Node service monitoring method, device, equipment and medium
CN112994935B (en) * 2021-02-04 2022-06-17 烽火通信科技股份有限公司 prometheus management and control method, device, equipment and storage medium
CN112965874B (en) * 2021-03-04 2023-02-28 浪潮云信息技术股份公司 Configurable monitoring alarm method and system
CN113064762B (en) * 2021-04-09 2024-02-23 上海新炬网络信息技术股份有限公司 Service self-recovery method based on various detection
CN113422692A (en) * 2021-05-28 2021-09-21 作业帮教育科技(北京)有限公司 Method, device and storage medium for detecting and processing node faults in K8s cluster
CN113485896A (en) * 2021-07-22 2021-10-08 京东方科技集团股份有限公司 Container state monitoring method, device, system and medium
CN113590420B (en) * 2021-07-28 2024-04-12 杭州玳数科技有限公司 Cluster state supervision method and device
CN113568707B (en) * 2021-07-29 2024-06-25 中国船舶重工集团公司第七一九研究所 Computer control method and system for ocean platform based on container technology
US11947660B2 (en) 2021-08-31 2024-04-02 International Business Machines Corporation Securing pods in a container orchestration environment
CN113965459A (en) * 2021-10-08 2022-01-21 浪潮云信息技术股份公司 Consul-based method for monitoring host network to realize high availability of computing nodes
US11966280B2 (en) 2022-03-17 2024-04-23 Walmart Apollo, Llc Methods and apparatus for datacenter monitoring
CN114860270B (en) * 2022-04-29 2024-08-09 济南浪潮数据技术有限公司 Monitoring management method for node to be injected with fault and related components
CN115174644B (en) * 2022-06-28 2023-09-12 武汉烽火技术服务有限公司 Container cluster service start-stop control method, device, equipment and storage medium
CN115396278A (en) * 2022-08-11 2022-11-25 西安雷风电子科技有限公司 System exception handling method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160182330A1 (en) * 2009-12-10 2016-06-23 Royal Bank Of Canada Coordinated processing of data by networked computing resources
CN106130778A (en) * 2016-07-18 2016-11-16 浪潮电子信息产业股份有限公司 A kind of method processing clustering fault and a kind of management node
CN108549580A (en) * 2018-03-30 2018-09-18 平安科技(深圳)有限公司 Methods and terminal device of the automatic deployment Kubernetes from node
CN108737215A (en) * 2018-05-29 2018-11-02 郑州云海信息技术有限公司 A kind of method and apparatus of cloud data center Kubernetes clusters container health examination

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160182330A1 (en) * 2009-12-10 2016-06-23 Royal Bank Of Canada Coordinated processing of data by networked computing resources
CN106130778A (en) * 2016-07-18 2016-11-16 浪潮电子信息产业股份有限公司 A kind of method processing clustering fault and a kind of management node
CN108549580A (en) * 2018-03-30 2018-09-18 平安科技(深圳)有限公司 Methods and terminal device of the automatic deployment Kubernetes from node
CN108737215A (en) * 2018-05-29 2018-11-02 郑州云海信息技术有限公司 A kind of method and apparatus of cloud data center Kubernetes clusters container health examination

Also Published As

Publication number Publication date
CN110798375A (en) 2020-02-14

Similar Documents

Publication Publication Date Title
CN110798375B (en) Monitoring method, system and terminal equipment for enhancing high availability of container cluster
CN109495312B (en) Method and system for realizing high-availability cluster based on arbitration disk and double links
US10491671B2 (en) Method and apparatus for switching between servers in server cluster
CN109144789B (en) Method, device and system for restarting OSD
CN107480014B (en) High-availability equipment switching method and device
US20180067795A1 (en) Systems and methods for automatic replacement and repair of communications network devices
CN107660289B (en) Automatic network control
CN109726046B (en) Machine room switching method and device
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
WO2015169199A1 (en) Anomaly recovery method for virtual machine in distributed environment
CN113726553A (en) Node fault recovery method and device, electronic equipment and readable storage medium
CN110825490A (en) Kubernetes container-based application health check method and system
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN107453932B (en) Distributed storage system management method and device
CN115994044B (en) Database fault processing method and device based on monitoring service and distributed cluster
US7373542B2 (en) Automatic startup of a cluster system after occurrence of a recoverable error
CN112068935B (en) Kubernetes program deployment monitoring method, kubernetes program deployment monitoring device and kubernetes program deployment monitoring equipment
CN114020509A (en) Method, device and equipment for repairing work load cluster and readable storage medium
CN107491344B (en) Method and device for realizing high availability of virtual machine
CN111966520A (en) Database high-availability switching method, device and system
CN115964142A (en) Application service management method, device and storage medium
CN110321261B (en) Monitoring system and monitoring method
CN115328735A (en) Fault isolation method and system based on containerized application management system
CN115712521A (en) Cluster node fault processing method, system and medium
US9798608B2 (en) Recovery program using diagnostic results

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant