CN112511339B - Container monitoring alarm method, system, equipment and storage medium based on multiple clusters - Google Patents

Container monitoring alarm method, system, equipment and storage medium based on multiple clusters Download PDF

Info

Publication number
CN112511339B
CN112511339B CN202011251413.9A CN202011251413A CN112511339B CN 112511339 B CN112511339 B CN 112511339B CN 202011251413 A CN202011251413 A CN 202011251413A CN 112511339 B CN112511339 B CN 112511339B
Authority
CN
China
Prior art keywords
cluster
alarm
monitoring
index
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011251413.9A
Other languages
Chinese (zh)
Other versions
CN112511339A (en
Inventor
叶奕珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baofu Network Technology Shanghai Co ltd
Original Assignee
Baofu Network Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baofu Network Technology Shanghai Co ltd filed Critical Baofu Network Technology Shanghai Co ltd
Priority to CN202011251413.9A priority Critical patent/CN112511339B/en
Publication of CN112511339A publication Critical patent/CN112511339A/en
Application granted granted Critical
Publication of CN112511339B publication Critical patent/CN112511339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a container monitoring and alarming method, a system, equipment and a storage medium based on multiple clusters, wherein the method comprises the following steps: configuring capturing rules of indexes of all set resources in prometheus.yml through a monitoring module, deploying monitoring components of at least one cluster to be monitored, and periodically capturing instantaneous index data of running of each resource in the cluster by the monitoring components according to the preset capturing rules; yml, configuring alarm rules of all set resources in promemeus by an alarm module, configuring alarm information by an alarm management component, and sending the alarm information to a message notification module; when the instantaneous index data of any resource operation captured by the monitoring module triggers an alarm rule, the alarm information is sent to the message notification module through the Alertmanager. The method and the device can monitor the operation index of each node of the multiple clusters and give an alarm to abnormal conditions in time.

Description

Container monitoring alarm method, system, equipment and storage medium based on multiple clusters
Technical Field
The present invention relates to a cluster technology, and in particular, to a container monitoring and warning method, system, device, and storage medium based on multiple clusters.
Background
With the popularization of container technology, more and more enterprises develop applications through a micro-service framework, deliver codes in a mirror image mode, deploy operation services in a container mode, and switch operation and maintenance monitoring from a traditional virtual machine to monitoring of containers. Currently, the mainstream container monitoring scheme adopts the modes of exporters (collection) + Prometheus (pulling and storing) + Grafana (display graph) + alert (threshold alarm).
By adopting the modes of exporters (collection), prometheus (pulling and storing), grafana (display chart) and Alertmanager (threshold alarm), the technical requirements of operation and maintenance personnel are high, the configuration is complicated, the technical details of Prometheus, promQL query statements and the like need to be known, and the meanings of various running states and indexes of Kubernetes (K8 s for short) various resources need to be known. In addition, excessive storage space is wasted without simplified indexes, and monitoring and alarming in a multi-cluster environment need to maintain multiple sets of configuration. The excessive configuration greatly increases the learning and using cost of operation and maintenance personnel, and is especially useless for developers who want to customize application threshold value alarms.
Disclosure of Invention
The present invention is directed to a container monitoring and alarming method, system, device and storage medium based on multiple clusters, so as to solve the problems set forth in the foregoing technical background.
In order to achieve the purpose, the invention adopts the following technical scheme:
the first aspect of the present application provides a container monitoring and alarming method based on multiple clusters, including:
the method comprises the steps of maintaining a Prometous configuration file promemeus.yml through a monitoring module, configuring capture rules of indexes of all set resources in promemeus.yml, and deploying monitoring components of at least one cluster to be monitored, wherein the monitoring components capture instantaneous index data of running of each resource in the cluster periodically according to preset capture rules;
the method comprises the steps of maintaining a Prometous configuration file promemeus.yml through an alarm module, configuring alarm rules of all set resources in promemeus.yml, and configuring alarm information through an alarm management component Alertmanager to send the alarm information to a message notification module;
configuring account passwords of a message sending channel through a message notification module, and managing different alarm information to be sent to corresponding subscription terminals by adding a theme and the subscription terminal of the theme;
when the alarm rule is triggered by the instantaneous index data of any resource operation captured by the monitoring module, the alarm information is sent to the message notification module through the Alertmanager, and the message notification module sends the alarm information to the corresponding subscription terminal.
Preferably, the cluster is an 8ks cluster.
Preferably, the resources include one or more of a cluster, a host, a namespace, an application, and a container.
Preferably, the index includes one or more of a CPU, a memory, a storage disk, and a network.
Preferably, the grab rule includes one or more of grab address, grab cycle, index re-marking.
Preferably, deploying, by the monitoring module, the monitoring component of at least one cluster to be monitored includes: deploying an index capture storage component Prometheus and an alarm management component Alertmanager on a first cluster, deploying a host index collector node-explorer and a container index collector cAdviror respectively on each node of each cluster to be monitored, deploying a cluster state index collector club-state-metrics respectively on each cluster to be monitored, and,
and deploying a middleware collector corresponding to the specified middleware on each cluster to be monitored, wherein each middleware corresponds to an independent middleware collector.
More preferably, the host index collector node-expander and the container index collector cAdvisor collect the incoming index capture storage component Prometheus from the instantaneous index data running on each node (node), match the alarm rule configured in advance in yml profile Prometheus.
More preferably, in yml configuration file Prometheus, yml, the fetch address of the fetch pointer includes:
index access addresses of host index collector node-expoerter deployed by each node of each cluster;
index access addresses of container index collectors cAdvisor deployed at each node of each cluster;
index access addresses of a cluster state index collector kube-state-metrics deployed on each cluster; and (c) a second step of,
the pointer access address of each middleware collector deployed on each cluster.
More preferably, when at least one second cluster needs to join in monitoring, the first cluster records the grabbing address and the access token of the grabbing index of the second cluster, the grabbing address and the access token of the grabbing index of the second cluster are added to the cluster deployment file yaml, and after configuration is completed, a reloading configuration interface of Prometheus is called to enable configuration to take effect; wherein the first cluster and the second cluster are different clusters.
Preferably, the grab rule comprises: and taking the cluster/host/namespace/application/container example as a resource latitude, only pulling the indexes such as CPU/memory/network/storage disk and the like which are most concerned by the storage user, and filtering a large number of indexes which are useless for the user.
Preferably, the method further comprises:
generating a first alarm strategy according to a strategy instruction input by a user;
updating a promemeus.yml configuration file of promemeus according to the first alarm policy, wherein the updated promemeus.yml comprises the first alarm policy; and calling a reloading configuration interface of Prometheus to enable the configuration to be effective.
Preferably, after the alarm rule is triggered by the instantaneous index data of any captured resource operation, the method further comprises: and the user checks the alarm information through the UI visualization module.
Preferably, the message sending channel configured by the message notification module comprises one or more of a mailbox, a short message, an enterprise WeChat, a voice telephone notification and a QQ notification.
Preferably, the method further comprises: presetting a theme subscribed by a user, wherein the theme comprises alarm information interested by the user; and when the captured instantaneous index data of any resource operation triggers an alarm rule, sending alarm information associated with the theme through a configured message sending channel.
Preferably, the alarm information includes: cluster dimension warning items, node dimension warning items and container group dimension warning items.
More preferably, the cluster dimension alarm item includes at least one of: the utilization rate of the CPU exceeds 80%, the utilization rate of the memory exceeds 80%, the local storage of all nodes of the cluster exceeds 80%, the resource utilization of a namespace exceeds 80%, and the state of a cluster container group (pod) is abnormal.
More preferably, the node dimension alarm item includes at least one of: the utilization rate of the CPU of the node (node) exceeds 80%, the memory utilization rate of the node (node) exceeds 80%, and the local storage utilization condition of the node (node) exceeds 80%.
More preferably, the container group dimension alarm item includes at least one of: the CPU utilization rate of the container group (pod) exceeds 80%, and the memory utilization rate of the container group (pod) exceeds 80%.
A second aspect of the present application provides a container monitoring and warning system based on multiple clusters, including: monitoring module, alarm module and message notice module, wherein:
the monitoring module includes:
the index capture rule maintenance unit is used for configuring capture rules of indexes of all set resources in yml configuration files promemeus;
the monitoring component deployment unit is used for deploying the monitoring components of at least one cluster to be monitored through a cluster deployment file yaml, and the monitoring components are used for periodically capturing instantaneous index data of running of each resource in the cluster according to a preset capturing rule;
the alarm module comprises:
the system comprises an alarm rule maintenance unit, a resource setting unit and a resource setting unit, wherein the alarm rule maintenance unit is used for configuring alarm rules of all set resources in yml configuration files promemeus;
the receiving unit is used for receiving the alarm information sent by the monitoring module and pushing the alarm information to an alarm management component alert manager when the monitoring module determines that the instantaneous index data captured on the cluster to be monitored triggers an alarm rule;
the sending unit is used for sending the alarm information in the alarm management component alert manager to the message notification module;
and the message notification module is used for sending the alarm information to the corresponding subscription terminal according to the preset account number and the preset theme of the message sending channel, and the theme and the subscription terminal of the theme.
Preferably, the alarm module further comprises: and the alarm rule updating unit is used for recording a strategy instruction input by a user, generating a first alarm strategy, updating promemeus.yml configuration file promemeus.yml of promemeus according to the first alarm strategy, wherein the updated promemeus.yml comprises the first alarm strategy.
Preferably, the multi-cluster-based container monitoring and warning system further includes: and the UI visualization module is used for inquiring and/or displaying the alarm information sent by the alarm module and/or the instantaneous index data monitored by the monitoring module.
More preferably, the UI visualization module may be displayed through dashboard chart information.
Preferably, the message sending channel configured by the message notification module comprises one or more of a mailbox, a short message, an enterprise WeChat, a voice telephone notification and a QQ notification.
Preferably, the cluster is an 8ks cluster.
Preferably, the monitoring assembly comprises:
the index grabbing storage component Prometous is used for being deployed in the first cluster;
the alarm management component Alertmanager is used for being deployed in the first cluster;
the system comprises a host index collector node-explorer and a container index collector cAdvisor, wherein the host index collector node-explorer and the container index collector cAdvisor are used for being deployed at each node (node) of each cluster to be monitored;
the cluster state index collector kube-state-metrics is used for being deployed in each cluster to be monitored; and (c) a second step of,
and the middleware collector is used for being deployed in each cluster to be monitored, and each middleware collector corresponds to an independent middleware.
More preferably, in yml configuration file promemeus of promemeus, the fetch address of the fetch pointer includes:
index access addresses of host index collector node-expoerter deployed by each node of each cluster;
index access addresses of container index collectors cAdvisors deployed by each node of each cluster;
index access addresses of a cluster state index collector kube-state-metrics deployed on each cluster; and (c) a second step of,
the pointer access address of each middleware collector deployed on each cluster.
Preferably, the grab rule comprises: and taking cluster/host/namespace/application/container instances as resource latitude, only pulling and storing the indexes such as CPU/memory/network/storage disk and the like which are most concerned by the user, and filtering a large amount of indexes which are useless to the user.
Preferably, the alarm information includes: cluster dimension alarm items, node dimension alarm items and container group dimension alarm items.
More preferably, the cluster dimension alarm item includes at least one of: the utilization rate of a CPU exceeds 80%, the utilization rate of a memory exceeds 80%, the local storage of all nodes of the cluster exceeds 80%, the resource utilization of a namespace exceeds 80%, and the state of a cluster container group (pod) is abnormal.
More preferably, the node dimension alarm item includes at least one of: the utilization rate of a CPU of the node (node) exceeds 80%, the memory utilization rate of the node (node) exceeds 80%, and the local storage utilization condition of the node (node) exceeds 80%.
More preferably, the container group dimension alarm item includes at least one of: the CPU utilization rate of the container group (pod) exceeds 80%, and the memory utilization rate of the container group (pod) exceeds 80%.
The third aspect of the present application provides a container monitoring and warning device based on multiple clusters, including:
a memory having a computer program stored therein;
a processor for executing all computer programs in said memory for implementing the steps of said multi-cluster based container monitoring alarm method of the first aspect disclosed herein.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the multi-cluster based container monitoring alarm method of the first aspect disclosed herein.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the application discloses a container monitoring and alarming method, system, equipment and storage medium based on multiple clusters, wherein the monitoring module and the alarming module can monitor the operation indexes of each node and container of the multiple clusters and give an alarm in time for abnormal conditions, so that the reasonable adjustment and distribution of system resources are facilitated, and the overall performance of the clusters is improved;
the container monitoring and alarming system based on the Kubernetes cluster can be automatically deployed without complex configuration;
the method simplifies and optimizes mass resource monitoring indexes based on Kubernets;
the method and the device can customize the alarm rule and the push of the alarm information, so that operation and maintenance and developers can smoothly realize monitoring and alarm of the concerned application service on the premise of completely not knowing Prometheus and Kubernetes technologies.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 is a block diagram of a multi-cluster based container monitoring and warning system according to a preferred embodiment of the present invention;
FIG. 2 is a flow chart of cluster deployment in a preferred embodiment of the present invention;
FIG. 3 is a diagram of the cluster deployment results of the preferred embodiment of the present invention;
FIG. 4 is a flow chart of a multi-cluster based container monitoring alarm method according to a preferred embodiment of the present invention;
FIG. 5 is a flowchart of a user creating alert rules in accordance with a preferred embodiment of the present invention;
FIG. 6 is a functional block diagram of a multi-cluster based container monitoring and alert system in accordance with a preferred embodiment of the present invention;
fig. 7 is a schematic structural diagram of a multi-cluster-based container monitoring and warning device in accordance with a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that the data so used may be interchanged under appropriate circumstances. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A Kubernetes cluster (hereinafter referred to as a cluster) is composed of a plurality of host nodes. All the applications are managed by the cluster in a container form and distributed and deployed on the nodes through the cluster container orchestration function. The container monitoring and warning system can be deployed on a main cluster and supports monitoring of a plurality of clusters.
Fig. 1 is a block diagram of a container monitoring and warning system based on multiple clusters according to a preferred embodiment of the present invention. As shown in fig. 1, a multi-cluster-based container monitoring and alarming system includes: monitoring module 1, warning module 2, message notification module 3 and UI visualization module 4, wherein:
the monitoring module 1 includes:
the index capture rule maintenance unit is used for configuring capture rules of indexes of all set resources in yml configuration files prometheus.yml of Prometheus;
the monitoring component deployment unit is used for deploying the monitoring components of at least one cluster to be monitored through a cluster deployment file yaml, and the monitoring components are used for periodically capturing instantaneous index data of running of each resource in the cluster according to a preset capturing rule;
the alarm module 2 includes:
the warning rule maintenance unit is used for configuring warning rules of all set resources in yml configuration files promemeus;
the monitoring module 1 is used for capturing instantaneous index data of a cluster to be monitored, and sending the instantaneous index data to the receiving unit;
the sending unit is used for sending the alarm information in the alarm management component alert manager to the message notification module 3;
the message notification module 3 is used for sending the alarm information to the corresponding subscription terminal according to the preset account password of the message sending channel, the theme and the subscription terminal of the theme;
and the UI visualization module 4 is used for inquiring and/or displaying the alarm information sent by the alarm module 2 and/or the instantaneous index data monitored by the monitoring module 1.
The monitoring component in the above comprises:
1) The index grabbing storage component Prometous is used for being deployed in the main cluster;
2) The alarm management component Alertmanager is used for being deployed in the main cluster;
3) A core collector:
a host index collector node-explorer for being deployed at each node (node) of each cluster to be monitored;
a container index collector cAdvisor for being deployed at each node (node) of each cluster to be monitored;
the cluster state index collector kube-state-metrics is used for being deployed in each cluster to be monitored;
4) And (4) other collectors:
various middleware collectors corresponding to the middleware can be customized, such as collectors of MySQL, mongoDB, redis and the like, and only the cluster deployment file yaml needs to be provided under the path specified by the monitoring module, wherein each middleware instance deploys an independent middleware collector, for example, if the cluster has three MySQL, three middleware collectors need to be deployed, and each middleware collector is responsible for one MySQL.
When a plurality of clusters need to be added into monitoring, the main cluster needs to add information such as access addresses and access tokens of other clusters so as to normally access each cluster and deploy monitoring components.
Fig. 2 is a flow chart of cluster deployment in the present application, and a deployment result chart is shown with reference to fig. 3.
As shown in fig. 2, the deployment process of the cluster is:
step S01: judging whether a basic component (namely a monitoring component) is deployed, if so, executing a step S11, otherwise, executing a step S02;
step S02: generating a main cluster deployment file yaml;
step S11: judging whether a new cluster is deployed at the same time, if so, executing the step S12, otherwise, executing the step S21;
step S12: inputting the access address (capture address) and the access token of the new cluster, and executing step S13;
step S13: judging whether the network is connected, if so, executing a step S14, otherwise, executing a step S12;
step S14: judging whether a collector of a new cluster is deployed or not, if so, executing the step S15, otherwise, executing the step S21;
step S15: generating a new cluster deployment file yaml;
step S21: judging whether a new deployment file is generated, if so, executing the step S31, otherwise, ending the deployment process;
step 31: and starting to run the deployment file and ending the deployment process.
In the above, the access addresses of the grab indicators of all resources recorded in promemeus.
1) Index access addresses of host index collector node-expoerter deployed by each node of each cluster;
2) Index access addresses of container index collectors cAdvisor deployed at each node of each cluster;
3) Index access addresses of a cluster state index collector kube-state-metrics deployed on each cluster;
4) The pointer access address of each middleware collector deployed on each cluster.
The capture rules in the above are: the indexes of various collectors are filtered and recalculated, and only the indexes of CPU/memory/network/disk and the like which are most concerned by a storage user are pulled by taking a cluster/host/namespace/application/container example as a resource latitude, so that a large number of indexes which are useless to the user are eliminated, the storage pressure is reduced, and the query performance of the user is greatly improved.
In the above content, when a new cluster is added, after the main cluster records a new cluster access address and an access token, the monitoring module adds an index access address and an access token for accessing a new cluster collector in a configuration file, and after configuration is completed, calls a reloading configuration interface of promemeus to enable configuration to take effect.
Fig. 4 is a flowchart of a container monitoring alarm method based on multiple clusters according to a preferred embodiment of the present invention. As shown in fig. 4, a container monitoring and alarming method based on multiple clusters includes:
step 01: and installing access addresses (grabbing addresses) for grabbing indexes for deploying all resources and alarm rules of all resources through yml configuration files of Prometous.
Wherein the access address includes: recording the index access address of a host index collector node-expoerter deployed at each node of each cluster; recording the index access address of a container index collector cAdviror deployed at each node of each cluster; recording index access addresses of a cluster state index collector kube-state-metrics deployed on each cluster; and recording the index access address of each middleware collector deployed on each cluster.
Step 02: deploying the monitoring component of at least one cluster to be monitored through the cluster deployment file yaml, wherein the monitoring component periodically captures instantaneous index data of each resource operation in the cluster according to a preset capture rule.
Deploying, by a monitoring module, a monitoring component of at least one cluster to be monitored, comprising: the method comprises the steps of deploying an index capture storage component Prometheus and an alarm management component Alertmanager on a first cluster, deploying a host index collector node-inserter and a container index collector cAdviror on each node (node) of each cluster to be monitored respectively, deploying a cluster state index collector club-state-metrics on each cluster to be monitored respectively, deploying a middleware collector corresponding to a specified middleware on each cluster to be monitored, and enabling each middleware to correspond to an independent middleware collector.
Step 03: when the instantaneous index data of any resource operation captured by the monitoring module triggers an alarm rule, the alarm information is sent to the message notification module through the Alertmanager.
The method comprises the steps that a host index collector node-expander and a container index collector cAdviror collect and transmit instant index data running on each node, an index capture storage component Prometheus is collected and transmitted, alarm rules configured in yml configuration files of the Prometheus are matched, and if the alarm rules are triggered, alarm management components Alertmanager configure alarm information and transmit the alarm information to a message notification module.
Step 04: and the message notification module sends the alarm information to a corresponding subscription terminal.
And the message notification module is configured with an account password of a message sending channel, and manages different alarm information to be sent to the corresponding subscription terminal by adding a theme and the subscription terminal of the theme. The message sending channel configured by the message notification module can be a mailbox, a short message, an enterprise WeChat, a voice telephone notification, a QQ notification and the like. The message notification module presets a topic subscribed by the user, wherein the topic comprises the warning information interested by the user. And when the captured instantaneous index data of any resource operation triggers an alarm rule, the message notification module sends alarm information associated with the theme to the subscription terminal through the configured message sending channel.
In a specific application scenario, the writing threshold of the configuration file is high, and taking the yaml file as an example, a user needs to know information such as attributes (such as names, deployment units and the like) of each container on a cluster to be monitored and meanings of various data indexes very much, so that a correct yaml file can be written, the operation is complex, and the monitoring efficiency is reduced. Therefore, in the application, a user can create an alarm rule through the UI visualization module, generate a configuration page of the alarm rule, issue a policy instruction through the configuration page to generate a first alarm policy, update the yml configuration file of Prometheus according to the first alarm policy, where the updated yml configuration file of Prometheus includes the first alarm policy, and then activate the alarm rule by using a mechanism of reloading the configuration file of Prometheus.
For example, a user may add an alarm rule through the UI visualization module, monitor all container instances (resources) under all clusters, and alarm a subscribing terminal subscribing to a specified topic when the memory usage rate (index) is greater than (condition) 80% (threshold). And the alarm module records the alarm rule created by the user, modifies the Prometheus configuration file and activates the alarm rule by utilizing a Prometheus reloading configuration file mechanism.
In addition, after the alarm rule is triggered by the instantaneous index data of any resource operation, the user can also check alarm information through the UI visualization module.
Specifically, a flow chart of creating the alarm rule is shown in fig. 5.
In the foregoing content, the alarm information includes: cluster dimension alarm items, node dimension alarm items and container group dimension alarm items.
Wherein the cluster dimension alarm item may include: the utilization rate of a CPU exceeds 80%, the utilization rate of a memory exceeds 80%, the local storage of all nodes of the cluster exceeds 80%, the resource utilization of a namespace exceeds 80%, and the state of a cluster container group (pod) is abnormal.
Wherein the node dimension alarm item may include: the utilization rate of the CPU of the node (node) exceeds 80%, the memory utilization rate of the node (node) exceeds 80%, and the local storage utilization condition of the node (node) exceeds 80%.
Wherein the container group dimension alarm item may include: the CPU utilization rate of the container group (pod) exceeds 80%, and the memory utilization rate of the container group (pod) exceeds 80%.
Referring to fig. 6, the operation principle of the container monitoring and warning system of the present application is as follows:
1) And the monitoring module maintains the index access address and the index capture rule of each cluster collector in promemeus.
2) And the alarm module maintains an alarm rule formula in prometheus.yml, and adds and modifies the alarm rule through the UI visualization module.
3) And (3) Prometheus loading configuration, periodically capturing the instantaneous indexes of each collector according to the index access address and the index capture rule, wherein the collectors do not store data, but enable the Prometheus to capture the instantaneous indexes.
4) And the Prometheus periodically calculates whether the alarm rule expression reaches the requirement index threshold value according to the alarm rule.
5) Prometheus pushes alerts to alert manager when the alert rule expression satisfies a condition, such as memory usage of a certain container instance is greater than 80%.
6) Summarizing and alarming and pushing: and after the alarm is collected into the alert manager, sending the alarm information to the message notification module according to the configuration file of the alert manager.
7) The message notification module is pre-configured with account passwords of message sending channels (short messages, mailboxes, enterprise WeChats and the like), and reasonably manages different alarms to be sent to different subscription terminals by adding themes and terminals (mobile phone numbers, mailbox addresses and the like) subscribed by the themes. Once the alarm rule is triggered, the user can receive a notification through a preset sending channel, a preset theme and a preset subscription terminal.
The present application further provides a multi-cluster-based container monitoring and alarming device, which may specifically be a client deployed with a kubernets platform, as shown in fig. 7, the container monitoring and alarming device includes a memory 31 and a processor 32, where the memory 31 stores a computer program, and the processor 32 is configured to execute all the computer programs in the memory 31, so as to implement the steps of the multi-cluster container monitoring and alarming method described above.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for multi-cluster container monitoring alarm as described above.
In summary, the present application discloses a container monitoring and alarming method, system, device and storage medium based on multiple clusters, which can monitor the operation index of each node of the multiple clusters through a monitoring module and an alarming module, and alarm the abnormal condition in time, thereby facilitating reasonable adjustment and allocation of system resources and improving the overall performance of the clusters; according to the method and the device, a container monitoring alarm system based on a Kubernets cluster can be automatically deployed without complex configuration; the method simplifies and optimizes a large amount of resource monitoring indexes based on Kubernetes; the method and the device can customize the alarm rule and the push of the alarm information, so that operation and maintenance and developers can smoothly monitor and alarm the concerned application service on the premise of completely not knowing Prometheus and Kubernets technologies.
The embodiments of the present invention have been described in detail, but the embodiments are merely examples, and the present invention is not limited to the embodiments described above. Any equivalent modifications and substitutions for the present invention are within the scope of the present invention for those skilled in the art. Accordingly, equivalent alterations and modifications are intended to be included within the scope of the present invention, without departing from the spirit and scope of the invention.

Claims (8)

1. The container monitoring and alarming method based on the multi-cluster is characterized by being applied to a multi-cluster environment and comprising the following steps: the method comprises the steps that a container monitoring and alarming system is deployed on a main cluster and supports monitoring of a plurality of clusters, wherein the container monitoring and alarming system comprises a monitoring module, an alarming module and a message notification module;
the method comprises the steps of maintaining a Prometous configuration file promemeus.yml through a monitoring module, configuring capture rules of indexes of all set resources in promemeus.yml, and deploying monitoring components of at least one cluster to be monitored, wherein the monitoring components capture instantaneous index data of running of each resource in the cluster periodically according to preset capture rules;
the method comprises the steps of maintaining a Prometous configuration file promemeus.yml through an alarm module, configuring alarm rules of all set resources in promemeus.yml, and configuring alarm information through an alarm management component Alertmanager to send the alarm information to a message notification module;
configuring account passwords of a message sending channel through a message notification module, and managing different alarm information to be sent to corresponding subscription terminals by adding a theme and the subscription terminal of the theme;
when the instantaneous index data of any resource operation captured by the monitoring module triggers an alarm rule, the alarm information is sent to the message notification module through the Alertmanager, and the message notification module sends the alarm information to the corresponding subscription terminal;
the deployment of the monitoring component of at least one cluster to be monitored through the monitoring module comprises the following steps: deploying an index capture storage component Prometheus and an alarm management component Alertmanager on a first cluster, deploying a host index collector node-inserter and a container index collector cAdviror on each node of each cluster to be monitored respectively, deploying a cluster state index collector club-state-metrics on each cluster to be monitored respectively, deploying a middleware collector corresponding to a specified middleware on each cluster to be monitored, and enabling each middleware to correspond to an independent middleware collector; the capture rule comprises: filtering and recalculating indexes of various collectors, taking a cluster/host/namespace/application/container example as a resource latitude, only pulling and storing the indexes of a CPU/memory/network/storage disk most concerned by a user, and filtering the indexes which are useless for the user; when at least one second cluster needs to be added into monitoring, the first cluster records the grabbing address and the access token of the grabbing index of the second cluster, the grabbing address and the access token of the grabbing index of the second cluster are added into a cluster deployment file yaml, and after configuration is completed, a reloading configuration interface of Prometheus is called to enable the configuration to take effect; the first cluster and the second cluster are different clusters, and the first cluster is a main cluster.
2. The multi-cluster-based container monitoring alarm method according to claim 1, wherein instantaneous index data running on each node is collected by a host index collector node-expander and a container index collector cAdvisor into an index capture storage component promemeus, matching alarm rules preconfigured in yml profile promemeus. Yml of promemeus, and if an alarm rule is triggered, an alarm management component alert is sent to a message notification module by an alarm manager.
3. The multi-cluster-based container monitoring alarm method according to claim 1, wherein in yml configuration file Prometheus, yml, the capture address of the indicator comprises:
index access addresses of host index collector node-expoerter deployed by each node of each cluster;
index access addresses of container index collectors cAdvisor deployed at each node of each cluster;
index access addresses of a cluster state index collector kube-state-metrics deployed on each cluster; and the number of the first and second groups,
the pointer access address of each middleware collector deployed on each cluster.
4. The multi-cluster-based container monitoring alarm method according to claim 1, further comprising:
generating a first alarm strategy according to a strategy instruction input by a user;
updating a promemeus.yml configuration file of promemeus according to the first alarm policy, wherein the updated promemeus.yml comprises the first alarm policy; and calling a reloading configuration interface of Prometheus to enable the configuration to be effective.
5. The multi-cluster based container monitoring alarm method of claim 1, further comprising: presetting a theme subscribed by a user, wherein the theme comprises alarm information interested by the user; and when the captured instantaneous index data of any resource operation triggers an alarm rule, sending alarm information associated with the theme through a configured message sending channel.
6. A multi-cluster based container monitoring and warning system, deployed on a master cluster, supporting monitoring of multiple clusters, the system comprising: monitoring module, alarm module and message notice module, wherein:
the monitoring module comprises:
the index capture rule maintenance unit is used for configuring capture rules of indexes of all set resources in yml configuration files promemeus;
the monitoring component deployment unit is used for deploying the monitoring components of at least one cluster to be monitored through a cluster deployment file yaml, and the monitoring components are used for periodically capturing instantaneous index data of running of each resource in the cluster according to a preset capturing rule;
the alarm module comprises:
the system comprises an alarm rule maintenance unit, a resource setting unit and a resource setting unit, wherein the alarm rule maintenance unit is used for configuring alarm rules of all set resources in yml configuration files promemeus;
the receiving unit is used for receiving the alarm information sent by the monitoring module and pushing the alarm information to an alarm management component alert manager when the monitoring module determines that the instantaneous index data captured on the cluster to be monitored triggers an alarm rule;
the sending unit is used for sending the alarm information in the alarm management component alert manager to the message notification module;
the message notification module is used for sending the alarm information to the corresponding subscription terminal according to the preset account password of the message sending channel, the preset topic and the preset subscription terminal of the topic;
wherein the monitoring assembly comprises:
the system comprises an index capture storage component Prometheus and a target storage component, wherein the index capture storage component is used for being deployed in a first cluster, and the first cluster is a main cluster;
the alarm management component Alertmanager is used for being deployed in the first cluster;
the system comprises a host index collector node-explorer and a container index collector cAdvisor, wherein the host index collector node-explorer and the container index collector cAdvisor are used for being deployed at each node (node) of each cluster to be monitored;
the cluster state index collector kube-state-metrics is used for being deployed in each cluster to be monitored; and the number of the first and second groups,
the middleware collector is used for being deployed in each cluster to be monitored, and each middleware collector corresponds to an independent middleware;
wherein the crawling rules comprise: and filtering and recalculating indexes of various collectors, taking cluster/host/namespace/application/container examples as resource latitudes, only pulling and storing the indexes of CPU/memory/network/storage disk most concerned by users, and filtering the indexes useless for the users.
7. A multi-cluster based container monitoring and warning device, comprising:
a memory having a computer program stored therein;
a processor for executing all computer programs in said memory to implement the steps of the multi-cluster based container monitoring alarm method according to any of claims 1 to 5.
8. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the multi-cluster based container monitoring alarm method according to any of the claims 1 to 5.
CN202011251413.9A 2020-11-09 2020-11-09 Container monitoring alarm method, system, equipment and storage medium based on multiple clusters Active CN112511339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011251413.9A CN112511339B (en) 2020-11-09 2020-11-09 Container monitoring alarm method, system, equipment and storage medium based on multiple clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011251413.9A CN112511339B (en) 2020-11-09 2020-11-09 Container monitoring alarm method, system, equipment and storage medium based on multiple clusters

Publications (2)

Publication Number Publication Date
CN112511339A CN112511339A (en) 2021-03-16
CN112511339B true CN112511339B (en) 2023-04-07

Family

ID=74957795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011251413.9A Active CN112511339B (en) 2020-11-09 2020-11-09 Container monitoring alarm method, system, equipment and storage medium based on multiple clusters

Country Status (1)

Country Link
CN (1) CN112511339B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112671602B (en) * 2020-12-14 2023-07-04 北京金山云网络技术有限公司 Data processing method, device, system, equipment and storage medium of edge node
CN112925649B (en) * 2021-03-31 2021-09-14 中国人民解放军国防科技大学 Unified monitoring method for virtual network functions
CN113242150B (en) * 2021-06-03 2022-11-22 上海天旦网络科技发展有限公司 Calico network plug-in-based data packet capturing method and system in K8s
CN113377617B (en) * 2021-06-11 2023-06-16 重庆农村商业银行股份有限公司 Monitoring system
CN113419818B (en) * 2021-06-23 2022-06-10 北京达佳互联信息技术有限公司 Basic component deployment method, device, server and storage medium
CN113542068B (en) * 2021-07-15 2022-12-23 中国银行股份有限公司 Redis multi-instance monitoring system and method
CN113778614A (en) * 2021-08-03 2021-12-10 科大国创云网科技有限公司 Cluster abnormity monitoring and warning system and method facing enterprise service bus
CN113377626B (en) * 2021-08-11 2021-11-23 上海领健信息技术有限公司 Visual unified alarm method, device, equipment and medium based on service tree
CN113704065A (en) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 Monitoring method, device, equipment and computer storage medium
CN113791954B (en) * 2021-09-17 2023-09-22 上海道客网络科技有限公司 Container bare metal server and method and system for coping physical environment risk of container bare metal server
CN114189423A (en) * 2021-12-08 2022-03-15 兴业银行股份有限公司 Intelligent inquiry alarm system, method and medium with comprehensive compatibility and expansion
CN114253807B (en) * 2021-12-20 2023-04-07 深圳前海微众银行股份有限公司 Alarm information notification method and device
CN115150292A (en) * 2022-05-17 2022-10-04 深圳萨摩耶数字科技有限公司 Monitoring method and device for k8s cluster, electronic equipment and storage medium
CN114884838B (en) * 2022-05-20 2023-05-12 远景智能国际私人投资有限公司 Monitoring method and server of Kubernetes component
CN114926288A (en) * 2022-06-06 2022-08-19 中信建投证券股份有限公司 Intelligent strategy monitoring cloud platform and intelligent strategy monitoring method and device
CN115022196A (en) * 2022-06-14 2022-09-06 启明信息技术股份有限公司 Method and system for predicting software operation problems and giving alarm
CN117369981A (en) * 2022-06-30 2024-01-09 中兴通讯股份有限公司 Container adjusting method, device and storage medium based on monitor
CN114860510B (en) * 2022-07-08 2022-12-02 飞狐信息技术(天津)有限公司 Data monitoring method and system of micro-service system
CN114944980B (en) * 2022-07-26 2022-10-21 上海有孚智数云创数字科技有限公司 System method, apparatus, and medium for monitoring alarms
CN115473783A (en) * 2022-08-04 2022-12-13 浪潮软件集团有限公司 Prometheus-based index alarm management system and method
CN115080366B (en) * 2022-08-22 2022-11-15 深圳依时货拉拉科技有限公司 Alarm method, alarm device, computer equipment and storage medium
CN115801539A (en) * 2022-11-16 2023-03-14 浪潮云信息技术股份公司 Tenant-side container monitoring, collecting and alarming method and system under container cloud scene
CN115801541B (en) * 2022-11-18 2024-03-22 湖南长银五八消费金融股份有限公司 Method and device for alarming slow access in full-link tracking platform and computer equipment
CN115827393B (en) * 2023-02-21 2023-10-20 德特赛维技术有限公司 Server cluster monitoring and alarming system
CN116346904B (en) * 2023-05-19 2023-09-22 北京奇虎科技有限公司 Information pushing method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941531A (en) * 2019-11-15 2020-03-31 北京浪潮数据技术有限公司 Monitoring alarm method, device and equipment for monitoring alarm management platform
CN111459763A (en) * 2020-04-03 2020-07-28 中国建设银行股份有限公司 Cross-kubernets cluster monitoring system and method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11102281B2 (en) * 2019-02-15 2021-08-24 International Business Machines Corporation Tool for managing and allocating resources in a clustered computing environment
CN110417901B (en) * 2019-07-31 2022-04-29 北京金山云网络技术有限公司 Data processing method and device and gateway server
CN110780918B (en) * 2019-10-28 2022-08-23 江苏满运软件科技有限公司 Middleware container processing method and device, electronic equipment and storage medium
CN111045901B (en) * 2019-12-11 2024-03-22 东软集团股份有限公司 Container monitoring method and device, storage medium and electronic equipment
CN111459749A (en) * 2020-03-18 2020-07-28 平安科技(深圳)有限公司 Prometous-based private cloud monitoring method and device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941531A (en) * 2019-11-15 2020-03-31 北京浪潮数据技术有限公司 Monitoring alarm method, device and equipment for monitoring alarm management platform
CN111459763A (en) * 2020-04-03 2020-07-28 中国建设银行股份有限公司 Cross-kubernets cluster monitoring system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
田贞朗 ; .Kubernetes基于Prometheus弹性伸缩POD的方法.计算机产品与流通.2020,(03),全文. *

Also Published As

Publication number Publication date
CN112511339A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112511339B (en) Container monitoring alarm method, system, equipment and storage medium based on multiple clusters
WO2021017301A1 (en) Management method and apparatus based on kubernetes cluster, and computer-readable storage medium
CN109714192B (en) Monitoring method and system for monitoring cloud platform
US9866510B2 (en) Coordinated notifications across multiple channels
CN102652410B (en) Cloud computing supervision and management system
CN108712501B (en) Information sending method and device, computing equipment and storage medium
CN103501237B (en) Device management method, management platform, equipment and system
CN105760240A (en) Distributed task processing method and device
CN102089749B (en) Method and apparatus for managing binding information about a bundle installed remotely in an osgi service platform
WO2019153532A1 (en) Deployment method and apparatus for monitoring system, and computer device and storage medium
CN113377626B (en) Visual unified alarm method, device, equipment and medium based on service tree
CN107819632A (en) A kind of dynamic load leveling group system based on performance monitoring system and Docker Swarm
CN112511580A (en) Message pushing method, device, storage medium and equipment
US9922539B1 (en) System and method of telecommunication network infrastructure alarms queuing and multi-threading
CN110138753B (en) Distributed message service system, method, apparatus, and computer-readable storage medium
CN104506939B (en) A kind of information uploading method and television terminal
CN114168297A (en) Method, device, equipment and medium for scheduling collection tasks
CN110620798A (en) Control method, system, equipment and storage medium for FTP connection
CN113037549A (en) Operation and maintenance environment warning method
CN116846729A (en) Method for managing monitoring alarm notification based on multi-tenant mode under cloud container
CN115934464A (en) Information platform monitoring and collecting system
CN103957127A (en) Heterogeneous manufacturer transmission network interface adaptation method
CN112162897A (en) Public intelligent equipment management method and system
US7647596B2 (en) Method for sharing a data store across event management frameworks and system comprising same
CN113094053A (en) Product delivery method and device and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant