CN110798375B - Monitoring method, system and terminal equipment for enhancing high availability of container cluster - Google Patents
Monitoring method, system and terminal equipment for enhancing high availability of container cluster Download PDFInfo
- Publication number
- CN110798375B CN110798375B CN201910935666.9A CN201910935666A CN110798375B CN 110798375 B CN110798375 B CN 110798375B CN 201910935666 A CN201910935666 A CN 201910935666A CN 110798375 B CN110798375 B CN 110798375B
- Authority
- CN
- China
- Prior art keywords
- service
- docker
- etcd
- services
- monitoring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 137
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000002708 enhancing effect Effects 0.000 title claims abstract description 16
- 230000002159 abnormal effect Effects 0.000 claims abstract description 36
- 238000012360 testing method Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 17
- 238000001514 detection method Methods 0.000 claims description 8
- 238000012800 visualization Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 5
- 238000007689 inspection Methods 0.000 claims description 4
- 238000011084 recovery Methods 0.000 abstract description 14
- 238000012423 maintenance Methods 0.000 abstract description 11
- 230000006870 function Effects 0.000 description 6
- 238000009434 installation Methods 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 4
- 230000036541 health Effects 0.000 description 3
- 239000000523 sample Substances 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
- H04L41/0823—Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a monitoring method, a monitoring system and terminal equipment for enhancing high availability of a container cluster, and relates to the technical field of containers; the monitoring method comprises the following steps: configuring attributes of deployment nodes of kubelet, docker and etcd services, and installing monitoring processes adapted to corresponding services on the deployment nodes based on the node attributes; the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period; when the state information shows that the service is abnormal, restarting kubelet, docker and etcd services; the invention monitors the service states of kubel, docker and etcd at regular time through the monitoring process, ensures high availability of service through a service fault restarting method, alarms abnormal service, informs operation and maintenance personnel to perform necessary manual check recovery, and enhances the high availability of the cluster in a high concurrency scene.
Description
Technical Field
The invention belongs to the technical field of containers, and particularly relates to a monitoring method, a monitoring system and a monitoring terminal device for enhancing high availability of a container cluster.
Background
The Container Cluster (Container Cluster) is composed of a plurality of server hosts running kubernets, and different functional components run on different servers respectively to provide Container services for the outside. Kubernetes is used as a management center of container application, manages the life cycles of all containers in a cluster, and realizes high availability of an application layer in the cluster by combining self health check and error recovery mechanisms. High Availability (High Availability) is the percentage of the container cluster that remains up-time; the high availability of container clusters is typically increased by eliminating single points of failure through redundant designs or by reducing the time to failure through monitoring recovery mechanisms. With the development of the IT industry and the rise of the internet business micro-service concept, kubernets (hereinafter referred to as "k 8 s") based container clusters are more and more popular and have become the mainstream preferred practice in the industry at present.
The traditional scheme is that a native k8s is adopted to deploy container cluster services, and high availability of the container cluster is guaranteed by deploying a plurality of Master nodes, so that when the service of any one Master Node fails, other nodes can continue to provide service, the failed Master Node can be successfully recovered, and a multi-copy high-availability cluster is continuously provided.
Most of the platform components of the native cluster are arranged in a container mode, the number of copies and recovery mechanisms of operation can be guaranteed through container configuration information of the platform components, and basic container services for guaranteeing the operation of the platform components are registered as system services and are configured and managed through systems, such as kubelet, etcd and docker services. Generally, when the services are in failure and crash, the system restarts the services according to the configuration parameters; however, in a scenario of large-scale system access, a state in which the service is not broken but the service cannot be normally provided, that is, a so-called dead state, may occur. When such problems occur, the native mechanism is unsolved and restored, so that the whole container cluster is in an abnormal failure state, and the high availability is reduced.
Disclosure of Invention
Aiming at least one defect or improvement requirement in the prior art, the invention provides a monitoring method, a system and terminal equipment for enhancing high availability of a container cluster, wherein monitoring processes are respectively installed on deployment nodes of kubel, docker and etcd services to regularly monitor the service states of the kubel, docker and etcd, and the high availability of the services is ensured by a service failure restarting method; meanwhile, the abnormal service is alarmed to inform operation and maintenance personnel to carry out necessary manual check recovery, so that large-area service interruption is prevented, and high availability of the cluster in a high-concurrency scene is enhanced.
To achieve the above object, according to one aspect of the present invention, there is provided a monitoring method for enhancing high availability of a container cluster, the method comprising the steps of:
s1: configuring attributes of deployment nodes of kubelet, docker and etcd services, and calling an installation deployment tool to install a monitoring process adaptive to the corresponding service on the deployment nodes based on the attributes;
s2: the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period;
s3: and when the state information display service is abnormal, restarting kubelet, docker and etcd services.
Preferably, the monitoring method further includes the following steps:
alarm thresholds of kubbelet, docker and etcd services are respectively configured, and alarm information is generated when the service abnormal duration of any service exceeds the corresponding alarm threshold so as to inform operation and maintenance personnel to carry out necessary manual check and recovery and prevent large-area service interruption.
Preferably, the monitoring method further includes, before performing a restart process on the kubelet, docker, etcd service:
and testing the availability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services.
Preferably, the monitoring method further includes the following steps:
and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.
Preferably, in the monitoring method, the status information of the kubel service includes whether the kubel service is registered locally, whether the kubel service is running, and whether the API detection service is available;
the status information of the docker service comprises whether the docker service is registered in a local machine or not, whether the docker service is running or not and whether the docker service is available or not through detection of a returned value of a docker service command;
the state information of the etcd service comprises whether the etcd service is registered locally, whether an etcd service environment variable file exists, whether the etcd service is operated and whether an API (application program interface) detection service is available.
According to a second aspect of the present invention, there is also provided a monitoring system for enhancing high availability of a container cluster, comprising:
the configuration module is used for appointing the attributes of the deployment nodes of kubbelet, docker and etcd services and calling an installation and deployment tool to install a monitoring module adaptive to the corresponding service on the deployment nodes based on the attributes;
the monitoring module is used for regularly acquiring and recording the state information of the kubelet, docker and etcd services according to a set time period;
and the alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.
Preferably, in the monitoring system, the alarm module is further configured to configure an alarm threshold for kubbelet, docker, and etcd services, and generate alarm information when the service abnormality duration of any service exceeds the corresponding alarm threshold, so as to notify an operation and maintenance worker to perform necessary manual check and recovery, thereby preventing large-area service interruption.
Preferably, in the monitoring system, the alarm module is further configured to control the monitoring module to test the availability of the kubelet, docker, and etcd services before restarting processing, and restart processing is performed through a restart instruction provided by the kubelet, docker, and etcd services if a test result still shows that the services are abnormal.
Preferably, the monitoring system further comprises an inspection module and a visualization module;
the checking module is used for regularly checking whether the monitoring process of the monitoring module runs according to a set time period, and if the monitoring process does not run, the monitoring process is run again;
the visualization module is used for receiving the alarm information generated by the alarm module and performing visualization display.
According to a third aspect of the present invention, there is also provided a terminal device, comprising at least one processing unit, and at least one memory unit,
wherein the storage unit stores a computer program which, when executed by the processing unit, causes the processing unit to carry out the steps of the monitoring method of any one of the above.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) according to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, the monitoring processes are respectively installed on the deployment nodes of the kubelet, docker and etcd services to monitor the service states of the kubelet, docker and etcd at regular time, and the high availability of the services is ensured by a method of restarting service faults; meanwhile, the abnormal service is alarmed to inform operation and maintenance personnel to carry out necessary manual check recovery, so that large-area service interruption is prevented, and high availability of the cluster in a high-concurrency scene is enhanced.
(2) According to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, when the monitoring process detects the abnormal service, the restarting processing is not immediately carried out, but the detection result of the monitoring process is verified through multiple service availability tests, if the results of the multiple continuous tests indicate that the service is abnormal, the restarting service is executed, so that the deviation of the collected service running state caused by the fault of the monitoring process is effectively prevented, the system executes unnecessary restarting operation, and the system resource is wasted.
(3) According to the monitoring method, the monitoring system and the terminal equipment for enhancing the high availability of the container cluster, the inspection script is started regularly to check whether the monitoring process runs or not, and if the monitoring process does not run, the monitoring process is run again, so that the monitoring process is prevented from being failed, and the normal service monitoring function is prevented from being influenced.
Drawings
FIG. 1 is a flow chart of a monitoring method for enhancing high availability of a container cluster according to an embodiment of the present invention;
FIG. 2 is a logic block diagram of a monitoring system for enhancing high availability of a container cluster according to an embodiment of the present invention;
fig. 3 is a schematic composition diagram of a specific implementation manner of the monitoring system according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Fig. 1 is a flowchart of a monitoring method for enhancing high availability of a container cluster according to the present embodiment, and referring to fig. 1, the monitoring method includes the following steps:
s1: defining attributes of deployment nodes of kubelet, docker and etcd services, and calling an installation and deployment tool to install a monitoring process adaptive to the corresponding service on the deployment nodes based on the node attributes;
because three components of kubel, docker and etcd need to cooperate together to provide pod service, the scheme mainly monitors the three components which cannot be managed by pod, and the rest platform components can realize high-availability service through probe functions (probes) of the pod. The kubel and docker components are deployed in all nodes (masters and nodes), and the etcd component may be configured in the Master Node or the Node nodes according to different deployment configurations. Therefore, when using the ansable deployment k8s cluster, an etcd node needs to be defined, i.e., the node that deploys the etcd service.
Since different deployment nodes are configured with different types of services, for example: one Master node is configured with two components of kubbelet and docker, and the other Master node is simultaneously configured with three components of kubbelet, docker and etcd; therefore, in this embodiment, the attribute of each deployment node is defined first, and then an installation deployment tool provided by the system is called to install, deploy, and configure the corresponding monitoring process according to the attributes of different deployment nodes, so as to detect the service running state of the component of the corresponding deployment node.
S2: the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period;
after the monitoring process is installed and started, state information of kubel, docker and etcd services can be collected at regular time according to a set time period; the time period of the acquisition can be set according to the requirement, the monitoring process is defaulted to carry out monitoring acquisition every 10 seconds, and when any one component in kubel, docker and etcd cannot be served, the monitoring process records the state information of the corresponding component as service abnormity.
Wherein the status information of the kubel service includes:
(1) detecting whether the kubel service is registered locally;
(2) detecting whether the kubel service is running;
(3) detecting whether the service is available through a kubelet service health check api;
the state information of the docker service includes:
(1) detecting whether the docker service is registered in the local machine;
(2) detecting whether a docker service is running;
(3) detecting whether the data is available or not through a returned value of a probe docker service command;
the state information of the etcd service includes:
(1) detecting whether the etcd service is registered in the local machine;
(2) detecting whether an etcd service environment variable file exists or not;
(3) detecting whether the etcd service is running;
(4) whether the service is available is detected by the etcd service health check API.
S3: when the state information recorded by the monitoring process shows that the service is abnormal, restarting processing is carried out on kubelet, docker and etcd services, and high availability of the service is ensured by a service failure restarting method.
As a preferable example of the present embodiment, step S3 further includes:
alarm thresholds of kubbelet, docker and etcd services are configured in advance, and when the service abnormal duration of any one component exceeds the corresponding alarm threshold, alarm information is generated to inform operation and maintenance personnel to perform necessary manual check recovery, so that large-area service interruption is prevented. In the embodiment, the duration of the abnormal service is used as an alarm index, the alarm threshold is set to be 1 minute, and when the monitored service is continuously in an abnormal state within 1 minute, alarm information is generated to inform operation and maintenance personnel to perform necessary manual check recovery.
As a preferable example of this embodiment, before performing the restart process on the kubelet, docker, etcd service, the method further includes: and testing the usability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services. In the embodiment, the monitoring process is controlled to continuously execute 3 times of service availability tests, and if the service is abnormal in all the 3 times of continuous tests, the restarting operation is executed;
in the embodiment, when the monitoring process detects that the service is abnormal, the monitoring process does not immediately restart, but checks the detection result of the monitoring process through multiple service availability tests, and executes the restarting service if the results of the multiple continuous tests indicate that the service is abnormal, so that the situation that the collected service running state is deviated due to the fault of the monitoring process, the system executes unnecessary restarting operation and wastes system resources is effectively prevented.
As a preferable example of this embodiment, in order to prevent the monitoring process from being abnormal, the monitoring method further includes: and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.
In the embodiment, the checking script and the timing task are set, the checking script is started regularly according to the timing task to check whether the monitoring process runs, and if the monitoring process does not run, the monitoring process is run again, so that the monitoring process is prevented from being failed and the normal service monitoring function is prevented from being influenced.
Because kubelet, docker and etcd are clustered, the single node failure does not affect the overall service, and therefore the cluster service function is not affected during the detection and recovery of the node service failure. Under the condition of single node failure, the cluster service function is normal; in the case of multiple node failures, the cluster service can quickly recover to normal in a short time.
The embodiment also provides a system capable of implementing the monitoring method for enhancing the high availability of the container cluster, and as shown in fig. 2, the system includes a configuration module, a monitoring module and an alarm module; wherein,
the configuration module is used for appointing the attribute of a deployment node of kubbelet, docker and etcd services, and calling an installation deployment tool to install a monitoring module adaptive to the corresponding service on the deployment node based on the appointed node attribute; the monitoring module is used for running a monitoring process. The configuration module firstly defines the attribute of each deployment node, then calls an installation deployment tool provided by the system to install, deploy and configure the monitoring module with the corresponding monitoring process according to the attribute of the different deployment nodes, and detects the service running state of the component of the corresponding deployment node through the monitoring process.
The monitoring module is used for regularly acquiring and recording the state information of the kubbelet, docker and etcd services according to a set time period and reporting the state information to the alarm module; when any one of the components in the kubelet, docker and etcd cannot be served, the monitoring process records the state information of the corresponding component as service exception.
The alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.
Further preferably, the alarm module is further configured to pre-configure alarm thresholds for kubbelet, docker, and etcd services, and generate alarm information when the service abnormal duration of any service exceeds the corresponding alarm threshold, so as to notify operation and maintenance personnel to perform necessary manual check recovery, thereby preventing large-area service interruption. In this embodiment, the alarm module uses the duration of the abnormal service as an alarm index, the alarm threshold is set to 1 minute, and when the monitored service continuously maintains the abnormal state within 1 minute, alarm information is generated to notify the operation and maintenance personnel to perform necessary manual check and recovery.
Further preferably, the alarm module is further configured to control the monitoring module to test the availability of the kubelet, docker, and etcd services before the restart processing, and perform the restart processing through a restart instruction provided by the kubelet, docker, and etcd services if the test result still shows that the services are abnormal.
In the embodiment, the alarm module controls the monitoring module to continuously execute 3 service availability tests, and if the service is abnormal in all the 3 continuous tests, the restarting operation is executed; therefore, the situation that the collected service running state deviates due to the fault of the monitoring module, unnecessary restarting operation is executed, and system resources are wasted is effectively prevented.
As a preferable preference of this embodiment, the monitoring system further includes an inspection module and a visualization module; wherein,
the checking module is used for regularly running the checking script according to a set time period to detect whether the monitoring process of the monitoring module runs or not, and if the monitoring process does not run, the monitoring process is run again to prevent the monitoring module from being out of order and influencing the normal service monitoring function.
The visualization module is used for receiving the alarm information generated by the alarm module, visually displaying the alarm information, prompting an alarm on a system interface and informing operation and maintenance personnel to check as soon as possible.
All or part of the modules of the monitoring system can be realized by software, hardware and a combination thereof, and can be embedded in a processor of a computer device or independent of the processor in a hardware form, or can be stored in a memory of the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 3 is a specific implementation manner of the monitoring system provided in this embodiment, and as shown in fig. 3, the monitoring system includes Watchdog, Node extractor, Prometheus, Check Watchdog, and Grafana;
the Watchdog is mainly used for regularly acquiring and recording state information of kubbelet, docker and etcd services according to a set time period and reporting the state information to Prometheus;
the Prometheus is mainly responsible for processing the state information collected by the Watchdog and judging whether to generate an alarm or not; when the service state information of any component shows that the service is abnormal, the Prometous restarts the kubel, docker and etcd services; and the Prometheus takes the duration of the abnormal service as an alarm index, and generates alarm information when the duration of the abnormal service of the kubbelet, docker and etcd services exceeds a preset alarm threshold value so as to inform operation and maintenance personnel to perform necessary manual check recovery and prevent large-area service interruption.
In addition, before the Prometheus executes the restarting processing, the Watchdog is controlled to test the availability of the kubelet, docker and etcd services, and if the test result still shows that the services are abnormal, the restarting processing is executed through a restarting instruction provided by the kubelet, docker and etcd services.
A Node exporter is used as a communication Node between the Watchdog and the Prometous, is made of a Prometous _ client library and is mainly used for registering a dashboard monitoring index, starting a promethues client service, calling the Watchdog according to a set period to sequentially check the service available state of each component and recording the obtained service available state into monitoring; the service is restarted if the three checks fail.
Grafana is mainly used for visually displaying alarm information generated by Prometoxus and prompting alarms on a system interface.
The Check Watchdog is used for regularly running the Check script to detect whether the monitoring process of the Watchdog runs, and if the monitoring process does not run, the monitoring process is run again to prevent the Watchdog from being out of order.
The present embodiment also provides a terminal device, which includes at least one processor and at least one memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is enabled to execute the steps of the monitoring method. The type of processor and memory are not particularly limited, for example: the processor may be a microprocessor, digital information processor, on-chip programmable logic system, or the like; the memory may be volatile memory, non-volatile memory, a combination thereof, or the like.
The present embodiment also provides a computer-readable medium, which stores a computer program executable by a terminal device, and when the computer program runs on the terminal device, the computer program causes the terminal device to execute the steps of the monitoring method. Types of computer readable media include, but are not limited to, storage media such as SD cards, usb disks, fixed hard disks, removable hard disks, and the like.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A monitoring method for enhancing high availability of a container cluster, comprising the steps of:
s1: configuring attributes of deployment nodes of kubbelet, docker and etcd services, and installing a monitoring process adaptive to corresponding services on the deployment nodes based on the attributes so as to detect service running states of components of the corresponding deployment nodes;
s2: the monitoring process regularly acquires and records state information of kubelet, docker and etcd services according to a set time period;
s3: and when the state information display service is abnormal, restarting kubelet, docker and etcd services.
2. The monitoring method of claim 1, further comprising the steps of:
alarm thresholds of kubbelet, docker and etcd services are respectively configured, and alarm information is generated when the service abnormal duration of any service exceeds the corresponding alarm threshold.
3. The monitoring method of claim 1, further comprising, prior to restarting kubelet, docker, etcd services:
and testing the availability of the kubbelet, docker and etcd services through the monitoring process, and if the test result still shows that the services are abnormal, restarting the services through a restart instruction provided by the kubbelet, docker and etcd services.
4. The monitoring method of claim 1, further comprising the steps of:
and regularly checking whether the monitoring process runs according to a set time period, and if the monitoring process does not run, re-running the monitoring process.
5. The monitoring method of claim 2, wherein the status information of the kubel service includes whether the kubel service is registered locally, whether the kubel service is running, whether an API detects that a service is available;
the status information of the docker service comprises whether the docker service is registered in a local machine or not, whether the docker service is running or not and whether the docker service is available or not through detection of a returned value of a docker service command;
the state information of the etcd service comprises whether the etcd service is registered locally, whether an etcd service environment variable file exists, whether the etcd service is operated and whether an API (application program interface) detection service is available.
6. A monitoring system for enhancing high availability of a container cluster, comprising:
the configuration module is used for appointing attributes of deployment nodes of kubbelet, docker and etcd services, and installing a monitoring module adapted to corresponding services on the deployment nodes based on the attributes so as to detect the service running state of components of the corresponding deployment nodes;
the monitoring module is used for regularly acquiring and recording the state information of the kubelet, docker and etcd services according to a set time period;
and the alarm module is used for acquiring the state information acquired by the monitoring module, and when the state information display service is abnormal, the alarm module restarts the kubel, docker and etcd services.
7. The monitoring system of claim 6, wherein the alarm module is further configured to configure alarm thresholds for kubelet, docker, etcd services, and to generate alarm information when a service anomaly duration for any service exceeds a corresponding alarm threshold.
8. The monitoring system of claim 7, wherein the alarm module is further configured to control the monitoring module to test availability of kubelet, docker, and etcd services before restarting processing, and if the test result still indicates that the services are abnormal, restart processing is performed through a restart instruction provided by the kubelet, docker, and etcd services.
9. The monitoring system of claim 7, further comprising an inspection module and a visualization module;
the checking module is used for regularly checking whether the monitoring process of the monitoring module runs according to a set time period, and if the monitoring process does not run, the monitoring process is run again;
the visualization module is used for receiving the alarm information generated by the alarm module and performing visualization display.
10. A terminal device, characterized in that it comprises at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program which, when executed by the processing unit, causes the processing unit to carry out the steps of the monitoring method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910935666.9A CN110798375B (en) | 2019-09-29 | 2019-09-29 | Monitoring method, system and terminal equipment for enhancing high availability of container cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910935666.9A CN110798375B (en) | 2019-09-29 | 2019-09-29 | Monitoring method, system and terminal equipment for enhancing high availability of container cluster |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110798375A CN110798375A (en) | 2020-02-14 |
CN110798375B true CN110798375B (en) | 2021-10-01 |
Family
ID=69438676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910935666.9A Active CN110798375B (en) | 2019-09-29 | 2019-09-29 | Monitoring method, system and terminal equipment for enhancing high availability of container cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110798375B (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111338871A (en) * | 2020-02-27 | 2020-06-26 | 苏州浪潮智能科技有限公司 | Distributed file system Qzone high availability test method, system, equipment and storage medium |
CN111752805A (en) * | 2020-07-01 | 2020-10-09 | 浪潮云信息技术股份公司 | Cloud server resource monitoring and warning system |
CN111984366B (en) * | 2020-07-24 | 2023-01-06 | 苏州浪潮智能科技有限公司 | Method and system for containerized deployment of disaster recovery mechanism |
CN112162821B (en) * | 2020-09-25 | 2022-04-26 | 中国电力科学研究院有限公司 | Container cluster resource monitoring method, device and system |
CN112214323B (en) * | 2020-10-12 | 2022-06-14 | 苏州浪潮智能科技有限公司 | Resource recovery method and device and computer readable storage medium |
US11397632B2 (en) | 2020-10-30 | 2022-07-26 | Red Hat, Inc. | Safely recovering workloads within a finite timeframe from unhealthy cluster nodes |
CN112769922B (en) * | 2020-12-31 | 2022-07-12 | 南京视察者智能科技有限公司 | Device and method for self-starting micro service cluster |
CN112769652B (en) * | 2021-01-14 | 2022-12-16 | 苏州浪潮智能科技有限公司 | Node service monitoring method, device, equipment and medium |
CN112994935B (en) * | 2021-02-04 | 2022-06-17 | 烽火通信科技股份有限公司 | prometheus management and control method, device, equipment and storage medium |
CN112965874B (en) * | 2021-03-04 | 2023-02-28 | 浪潮云信息技术股份公司 | Configurable monitoring alarm method and system |
CN113064762B (en) * | 2021-04-09 | 2024-02-23 | 上海新炬网络信息技术股份有限公司 | Service self-recovery method based on various detection |
CN113422692A (en) * | 2021-05-28 | 2021-09-21 | 作业帮教育科技(北京)有限公司 | Method, device and storage medium for detecting and processing node faults in K8s cluster |
CN113485896A (en) * | 2021-07-22 | 2021-10-08 | 京东方科技集团股份有限公司 | Container state monitoring method, device, system and medium |
CN113590420B (en) * | 2021-07-28 | 2024-04-12 | 杭州玳数科技有限公司 | Cluster state supervision method and device |
CN113568707B (en) * | 2021-07-29 | 2024-06-25 | 中国船舶重工集团公司第七一九研究所 | Computer control method and system for ocean platform based on container technology |
US11947660B2 (en) | 2021-08-31 | 2024-04-02 | International Business Machines Corporation | Securing pods in a container orchestration environment |
CN113965459A (en) * | 2021-10-08 | 2022-01-21 | 浪潮云信息技术股份公司 | Consul-based method for monitoring host network to realize high availability of computing nodes |
US11966280B2 (en) | 2022-03-17 | 2024-04-23 | Walmart Apollo, Llc | Methods and apparatus for datacenter monitoring |
CN114860270B (en) * | 2022-04-29 | 2024-08-09 | 济南浪潮数据技术有限公司 | Monitoring management method for node to be injected with fault and related components |
CN115174644B (en) * | 2022-06-28 | 2023-09-12 | 武汉烽火技术服务有限公司 | Container cluster service start-stop control method, device, equipment and storage medium |
CN115396278A (en) * | 2022-08-11 | 2022-11-25 | 西安雷风电子科技有限公司 | System exception handling method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160182330A1 (en) * | 2009-12-10 | 2016-06-23 | Royal Bank Of Canada | Coordinated processing of data by networked computing resources |
CN106130778A (en) * | 2016-07-18 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | A kind of method processing clustering fault and a kind of management node |
CN108549580A (en) * | 2018-03-30 | 2018-09-18 | 平安科技(深圳)有限公司 | Methods and terminal device of the automatic deployment Kubernetes from node |
CN108737215A (en) * | 2018-05-29 | 2018-11-02 | 郑州云海信息技术有限公司 | A kind of method and apparatus of cloud data center Kubernetes clusters container health examination |
-
2019
- 2019-09-29 CN CN201910935666.9A patent/CN110798375B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160182330A1 (en) * | 2009-12-10 | 2016-06-23 | Royal Bank Of Canada | Coordinated processing of data by networked computing resources |
CN106130778A (en) * | 2016-07-18 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | A kind of method processing clustering fault and a kind of management node |
CN108549580A (en) * | 2018-03-30 | 2018-09-18 | 平安科技(深圳)有限公司 | Methods and terminal device of the automatic deployment Kubernetes from node |
CN108737215A (en) * | 2018-05-29 | 2018-11-02 | 郑州云海信息技术有限公司 | A kind of method and apparatus of cloud data center Kubernetes clusters container health examination |
Also Published As
Publication number | Publication date |
---|---|
CN110798375A (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110798375B (en) | Monitoring method, system and terminal equipment for enhancing high availability of container cluster | |
CN109495312B (en) | Method and system for realizing high-availability cluster based on arbitration disk and double links | |
US10491671B2 (en) | Method and apparatus for switching between servers in server cluster | |
CN109144789B (en) | Method, device and system for restarting OSD | |
CN107480014B (en) | High-availability equipment switching method and device | |
US20180067795A1 (en) | Systems and methods for automatic replacement and repair of communications network devices | |
CN107660289B (en) | Automatic network control | |
CN109726046B (en) | Machine room switching method and device | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
WO2015169199A1 (en) | Anomaly recovery method for virtual machine in distributed environment | |
CN113726553A (en) | Node fault recovery method and device, electronic equipment and readable storage medium | |
CN110825490A (en) | Kubernetes container-based application health check method and system | |
CN105607973B (en) | Method, device and system for processing equipment fault in virtual machine system | |
CN107453932B (en) | Distributed storage system management method and device | |
CN115994044B (en) | Database fault processing method and device based on monitoring service and distributed cluster | |
US7373542B2 (en) | Automatic startup of a cluster system after occurrence of a recoverable error | |
CN112068935B (en) | Kubernetes program deployment monitoring method, kubernetes program deployment monitoring device and kubernetes program deployment monitoring equipment | |
CN114020509A (en) | Method, device and equipment for repairing work load cluster and readable storage medium | |
CN107491344B (en) | Method and device for realizing high availability of virtual machine | |
CN111966520A (en) | Database high-availability switching method, device and system | |
CN115964142A (en) | Application service management method, device and storage medium | |
CN110321261B (en) | Monitoring system and monitoring method | |
CN115328735A (en) | Fault isolation method and system based on containerized application management system | |
CN115712521A (en) | Cluster node fault processing method, system and medium | |
US9798608B2 (en) | Recovery program using diagnostic results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |