CN116719608A - Fault monitoring system and method for K8S cluster - Google Patents

Fault monitoring system and method for K8S cluster Download PDF

Info

Publication number
CN116719608A
CN116719608A CN202310756919.2A CN202310756919A CN116719608A CN 116719608 A CN116719608 A CN 116719608A CN 202310756919 A CN202310756919 A CN 202310756919A CN 116719608 A CN116719608 A CN 116719608A
Authority
CN
China
Prior art keywords
bcc
monitoring
cluster
tool
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310756919.2A
Other languages
Chinese (zh)
Inventor
王洪磊
马超
聂彦超
邱春武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN202310756919.2A priority Critical patent/CN116719608A/en
Publication of CN116719608A publication Critical patent/CN116719608A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Abstract

The embodiment of the application provides a fault monitoring system and a fault monitoring method for a K8S cluster, wherein the system comprises the following steps: the K8S user side is used for selecting a target BCC template from preset BCC templates to be selected in the K8S cluster according to target monitoring requirements, and generating a BCC tool CRD resource and a BCC rule CRD resource according to the target BCC template; the monitoring device is deployed on each working node in the K8S cluster and is used for pulling the CRD resources of the BCC tool from the K8S cluster, obtaining the BCC tool based on analysis of the CRD resources of the BCC tool, and monitoring the working node by running the BCC tool to obtain monitoring result data; the monitoring result collection device is deployed on each working node in the K8S cluster and is used for pulling BCC rule CRD resources from the K8S cluster, obtaining monitoring rules based on analysis of the BCC rule CRD resources, generating Metrics of the working node according to the monitoring rules and monitoring result data of the working node, and establishing normal or abnormal marks for the Metrics according to the monitoring rules; and the analysis reporter is used for receiving the Metrics of each working node, analyzing all received Metrics and obtaining a fault analysis report.

Description

Fault monitoring system and method for K8S cluster
Technical Field
The application relates to the field of Kubernetes anomaly detection, in particular to a fault monitoring system and method for a K8S cluster.
Background
In the process of migrating department items to containers, more and more items need to be migrated to a containerized platform, the Kubernetes of the containerized platform are more and more, when more and more items and items are deployed to more and more Kubernetes clusters, call relations among services, container running states, operating system running states, basic software running states and the like can possibly cause abnormality of applications.
In carrying out the present application, the applicant has found that at least the following problems exist in the prior art:
the prior art can only provide specific-level anomaly detection, cannot perform multi-level anomaly analysis and detection including an operating system, application transfer, a container environment, a network environment and the like, has long and complex data, and is not easy to use.
Disclosure of Invention
The embodiment of the application provides a fault monitoring system and a fault monitoring method for a K8S cluster, which solve the problems that the prior art only can provide specific-level abnormality detection, cannot perform multi-level abnormality analysis and detection including an operating system, an application transfer, a container environment, a network environment and the like, has long and complex data and is not easy to use.
To achieve the above object, in a first aspect, an embodiment of the present application provides a fault monitoring system for a K8S cluster, including:
the K8S user side is used for selecting a target BCC template from preset BCC templates to be selected in the K8S cluster according to target monitoring requirements, generating a BCC tool CRD resource and a BCC rule CRD resource according to the target BCC template, and submitting the BCC tool CRD resource and the BCC rule CRD resource to the K8S cluster;
the monitoring device is deployed on each working node in the K8S cluster, is used for pulling the CRD resource of the BCC tool from the K8S cluster, analyzing the CRD resource of the BCC tool to obtain the BCC tool, monitoring the working node by running the BCC tool to obtain monitoring result data, and transmitting the monitoring result data to the monitoring result collecting device on the working node;
the monitoring result collection device is deployed on each working node in the K8S cluster, and is used for pulling the BCC rule CRD resource from the K8S cluster, analyzing the BCC rule CRD resource to obtain a monitoring rule, generating a metric of the working node according to the monitoring rule and monitoring result data of the working node, establishing a normal or abnormal mark for the metric according to the monitoring rule, and sending the normal or abnormal mark to an analysis reporter;
and the analysis reporter is used for receiving the Metrics of each working node, analyzing all received Metrics and obtaining a fault analysis report.
In a second aspect, an embodiment of the present application provides a fault monitoring method for a K8S cluster, including:
selecting a target BCC template from preset BCC templates to be selected in the K8S cluster according to target monitoring requirements, generating a BCC tool CRD resource and a BCC rule CRD resource according to the target BCC template, and submitting the BCC tool CRD resource and the BCC rule CRD resource to the K8S cluster;
pulling the BCC rule CRD resource from the K8S cluster, analyzing based on the BCC tool CRD resource to obtain a BCC tool, and monitoring the working node by running the BCC tool to obtain monitoring result data on the working node;
pulling the BCC rule CRD resource from the K8S cluster, analyzing based on the BCC rule CRD resource to obtain a monitoring rule, generating a Metrics of the working node according to the monitoring rule and monitoring result data of the working node, and establishing a normal or abnormal mark for the Metrics according to the monitoring rule;
and receiving each Metrics of each working node, and analyzing all received Metrics to obtain a fault analysis report.
The technical scheme has the following beneficial effects: the BCC resource types comprise BCC rule CRD resources and BCC rule CRD resources through a CRD function in the K8S system, and a monitoring device and a monitoring result collecting device are deployed in a Daemoset mode in the K8S system, so that each working node automatically operates the monitoring device and the monitoring result collecting device, the BCC rule CRD resources and the BCC rule CRD resources can be obtained by the working nodes like obtaining K8S standard resources, and a needed target BCC template is selected from the BCC templates to be selected in combination to carry out multi-level monitoring on an operating system, application software, a container environment and/or a network environment, so that multi-level abnormal analysis and detection on the K8S system are realized. By acquiring the monitoring rule monitoring result data from the BCC rule CRD resource, the automatic analysis and display of the monitoring result data are realized, the problem that technicians of ordinary development, testing, operation and maintenance directly analyze lengthy data is avoided, and the use difficulty is reduced. The ordinary technicians do not need to be familiar with the BCC tools or know how to operate the BCC tools at each working node in the K8S system, and can apply the fault detection function of the corresponding BCC tools in the K8S system by selecting a proper target BCC template according to respective requirements, so that the technical threshold of the ordinary technicians is reduced.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a fault monitoring system for a K8S cluster according to one embodiment of the present application;
fig. 2 is a flowchart of a fault monitoring method of a K8S cluster according to one embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Noun interpretation:
kubernetes (K8S): a container orchestration management tool;
CRD (Custom Resource Definition): a custom resource definition interface and method provided by kubernetes;
kubernetes native resource: such as deviyment, service, ingress, etc., is used to deploy services
A container; a container image is a lightweight, independently executable software package of software that contains everything that is needed to run it, such as code, runtime, system tools, system libraries, and settings, etc., and a container is the running state of a program in the container's impact.
Pod: the smallest service unit in Kubernetes.
BCC: BPF Compiler Collection an anomaly or performance detection tool set based on an eBPF operating system, software, network, file system, cpu, memory, etc.;
BPF: berkeley Packet Filter, berkeley packet filter, an original interface of data link layer on Unix-like system, provides the receiving and transmitting of the packets of original link layer;
eBPF: extended Berkeley Packet Filter, a generic execution engine extension based on BPF, is used to develop performance analysis tools and software defined networks, etc.
The inventors have found that some of the network components in the prior art solutions support solutions for analyzing the network relationship between containers. Some independent tools provide multiple chromatography unassociated data analysis and detection, and these schemes in the prior art have the following drawbacks: only the network relation between the containers can be analyzed, the communication condition between the containers can be detected, the detailed information of the specific communication condition can not be analyzed, and the problem can not be positioned; only provide the abnormality detection of specific level, the data is lengthy and complex, and is not easy to use; the method is very unfriendly to non-professional users, and has high learning cost; the analysis and detection of anomalies in multiple levels of container environments (operating systems, application turnings, container environments, network environments, etc.) is not supported. Based on the above drawbacks, the inventors have solved the following problems by the following examples: supporting full-automatic multi-level anomaly analysis and detection of the container environment; the template configuration is supported, and a user realizes the detection function by configuring parameters and selecting the template, so that the method is very simple and easy to use, and does not need to have professional knowledge; and supporting the monitoring of a large screen and displaying the running state of the system by a screen.
In a first aspect, as shown in fig. 1, an embodiment of the present application provides a fault monitoring system for a K8S cluster, including:
the K8S client 10 is configured to select a target BCC template from among the preset BCC templates to be selected in the K8S cluster 16 according to a target monitoring requirement, generate a BCC tool CRD resource and a BCC rule CRD resource according to the target BCC template, and submit the BCC tool CRD resource and the BCC rule CRD resource to the K8S cluster;
the monitoring device 11 is deployed on each working node in the K8S cluster 16, and is configured to pull the BCC tool CRD resource from the K8S cluster 16, obtain a BCC tool based on the BCC tool CRD resource analysis, monitor the working node by running the BCC tool to obtain monitoring result data, and transmit the monitoring result data to the monitoring result collecting device 12 on the working node;
the monitoring result collection device 12 is deployed on each working node in the K8S cluster 16, and is configured to pull the BCC rule CRD resource from the K8S cluster 16, parse the BCC rule CRD resource to obtain a monitoring rule, generate a metric of the working node according to the monitoring rule and the monitoring result data of the working node, establish a normal or abnormal mark for the metric according to the monitoring rule, and send the normal or abnormal mark to the analysis reporter 13;
an analysis reporter 13, configured to receive the respective Metrics of each working node, and analyze all the received Metrics to obtain a fault analysis report.
In some embodiments, the monitoring device 11 and the monitoring result collecting device 12 are deployed on the working nodes in the K8S cluster by means of Daemonset of the K8S system; the K8S user terminal 10 can be deployed on any node in the K8S cluster; the analysis reporter 13 is unique in the cluster and can be deployed on any node in the K8S cluster; the K8S client 10 may select one or more BCC template combinations from the BCC templates to be selected as target BCC templates to monitor one or more monitoring objects, where the BCC templates to be selected include a BCC tool that completes a specific monitoring requirement and a monitoring rule corresponding to the monitoring requirement and the BCC tool; each BCC template to be selected is applied to a BCC tool and a monitoring rule for monitoring one or more monitoring objects; the monitoring objects include, but are not limited to, an operating system, application software, a network, a file system, a cpu, and a memory; obtaining the BCC tool CRD resources and BCC rule CRD resources through a target BCC template and submitting the BCC tool CRD resources and the BCC rule CRD resources to a K8S cluster, wherein the BCC tool CRD resources and the BCC rule CRD resources can be submitted to an Etcd key value storage system in the K8S cluster, and other working nodes can pull the required BCC tool CRD resources and BCC rule CRD resources from the Etcd key value storage system like using K8S standard resources; each BCC tool may perform a specific monitoring function, such as monitoring anomalies or performance of an operating system, application software, a network, a file system, a cpu, a memory, etc., to obtain monitoring result data; the monitoring rule is used for carrying out constraint check on monitoring result data returned by the BCC tool so as to find out whether an abnormality exists; the monitoring result data obtained by monitoring on each working node and the corresponding monitoring rules are packaged into a Metrics format so as to be collected by a Metrics collector in K8S, normal or abnormal marks are built on the Metrics according to the monitoring rules, finally the Metrics data of all nodes are delivered to an analysis reporter 13 to generate a fault analysis report, the analysis reporter 13 receives the Metrics data of all nodes, and the higher the priority is, the greater the possibility that the higher the priority is abnormal is for each Metrics marking priority according to the abnormal or normal marks of the Metrics data, the higher the priority is for the Metrics of the operating system layer, the Metrics of the K8S cluster software are analyzed, the higher the priority is. The analysis reporter sends the analysis result, namely the fault analysis report, to the K8S cluster through the kubelet client on the same node of the analysis reporter, and the analysis report can be further sent out to a preset receiving end or a display end through the K8S cluster.
The embodiment of the application has the following technical effects: the BCC resource types comprise BCC rule CRD resources and BCC rule CRD resources through a CRD function in the K8S system, and a monitoring device and a monitoring result collecting device are deployed in a Daemoset mode in the K8S system, so that each working node automatically operates the monitoring device and the monitoring result collecting device, the BCC rule CRD resources and the BCC rule CRD resources can be obtained by the working nodes like obtaining K8S standard resources, and a needed target BCC template is selected from the BCC templates to be selected in combination to carry out multi-level monitoring on an operating system, application software, a container environment and/or a network environment, so that multi-level abnormal analysis and detection on the K8S system are realized. By acquiring the monitoring rule monitoring result data from the BCC rule CRD resource, the automatic analysis and display of the monitoring result data are realized, the problem that technicians of ordinary development, testing, operation and maintenance directly analyze lengthy data is avoided, and the use difficulty is reduced. The ordinary technicians do not need to be familiar with the BCC tools or know how to operate the BCC tools at each working node in the K8S system, and can apply the fault detection function of the corresponding BCC tools in the K8S system by selecting a proper target BCC template according to respective requirements, so that the technical threshold of the ordinary technicians is reduced.
Further, the system further comprises: a K8S control terminal 14;
the K8S control end 14 is configured to create at least one corresponding BCC template to be selected according to the received information of the BCC tool and the monitoring rule corresponding to the specific monitoring requirement, and submit the created at least one BCC template to the K8S cluster.
In some embodiments, at least one candidate BCC template is created in advance according to specific monitoring needs by a technician familiar with BCC tools and K8S systems, each candidate BCC template recording information of BCC tools and monitoring rules for monitoring one or several monitoring objects; the monitoring objects include, but are not limited to, an operating system, application software, a network, a file system, a cpu, and a memory; for a technician who is not familiar with the BCC tool, one or more candidate BCC templates may be selected from among the pre-created candidate BCC templates directly according to the target monitoring requirements to configure the working nodes in the K8S, thereby enabling the use of the BCC tool in the K8S for performance and fault monitoring.
Further, the monitoring device 11 includes:
the BCC client 111 is configured to pull the BCC tool CRD resource from the K8S cluster, parse the BCC tool CRD resource to obtain a BCC tool corresponding to the target monitoring requirement, and submit the obtained BCC tool to the BCC tool executing device 112 on the working node; and, receiving the monitoring result data from the BCC tool executing device 112 on the working node, and transmitting the received monitoring result data to the monitoring result collecting device 12 on the working node;
the BCC tool executing device 112 is configured to receive a BCC tool from the BCC client 111 on the working node, monitor the working node by running the BCC tool in a daemon manner, collect monitoring result data, and transmit the monitoring result data to the BCC client 111 on the working node.
In some embodiments, the functions of the monitoring device 11 may be deployed as a BCC client 111 and a BCC tool executing device 112, respectively, the BCC client 111 being responsible for acquiring BCC tools from the K8S cluster and for acquiring the resulting monitoring result data of the BCC tool execution; the BCC tool executing means 112 are arranged to run the BCC tool in a daemon manner to ensure that the BCC tool is able to constantly and stably monitor the operation of the working nodes.
Further, the monitoring result collecting device 12 includes:
the monitoring rule obtaining device 121 is configured to pull the BCC rule CRD resource from the K8S cluster by using a Kubelet client deployed on the working node, parse the BCC rule CRD resource to obtain a monitoring rule, and transmit the parsed monitoring rule to the Metrics data generating device 122 on the working node;
the Metrics data generating device 122 is configured to receive the monitoring rule from the monitoring rule acquiring device 121 on the working node and receive the monitoring result data from the monitoring device on the working node, and package the received monitoring rule and monitoring result data into Metrics data;
the Metrics data collector 123 is configured to collect Metrics data on a working node, determine, according to a monitoring rule encapsulated in the Metrics data, whether the monitoring result data encapsulated in the Metrics triggers a threshold defined by the monitoring rule, and whether a change value of the monitoring result data obtained in a current monitoring period relative to the monitoring result data before a fixed period exceeds a preset threshold, if it is determined that the monitoring result data encapsulated in the Metrics triggers the threshold defined by the monitoring rule or the change value exceeds the preset threshold, mark the Metrics data as abnormal, otherwise mark the Metrics data as normal, and transmit all the Metrics data to the analysis reporter 13.
In some embodiments, the monitoring rule acquiring device 121 pulls the BCC rule CRD resource from the K8S cluster through the Kubelet client program on the working node, acquires the monitoring rule, provides the monitoring rule to the Metrics data generating device 122, the Metrics data generating device 122 collects the monitoring rule and the monitoring result data, encapsulates the monitoring rule and the monitoring result data into Metrics data according to the protocol format of the Metrics data, and sends the obtained Metrics data to the analysis reporter 13, and the analysis reporter 13 parses the monitoring rule and the corresponding monitoring result data in each Metrics data from the Metrics data, and determines whether the monitoring result data has an abnormality according to the monitoring rule, so as to generate the fault analysis report. The analysis reporter 13 can obtain the Metrics data of all the working nodes in the K8S cluster, so the analysis reporter 13 can combine all the Metrics data for analysis to obtain the fault and performance data of each working node and the fault and performance data of the whole cluster, thereby realizing multi-level fault analysis.
Further, the system further comprises:
the display tray 15 is configured to receive the fault analysis report from the analysis reporter 13 and display the fault analysis report in a visual manner.
Through visual display of the fault analysis report, technicians can more intuitively and rapidly know the working state of the cluster.
In a second aspect, as shown in fig. 2, an embodiment of the present application provides a fault monitoring method for a K8S cluster, including:
step S100, selecting a target BCC template from preset candidate BCC templates in a K8S cluster according to target monitoring requirements, generating a BCC tool CRD resource and a BCC rule CRD resource according to the target BCC template, and submitting the BCC tool CRD resource and the BCC rule CRD resource to the K8S cluster;
step S101, pulling the CRD resource of the BCC tool from the K8S cluster, analyzing the CRD resource of the BCC tool to obtain the BCC tool, and monitoring the working node by running the BCC tool to obtain monitoring result data on the working node;
step S102, pulling the BCC rule CRD resource from the K8S cluster, analyzing based on the BCC rule CRD resource to obtain a monitoring rule, generating a Metrics of the working node according to the monitoring rule and the monitoring result data of the working node, and establishing a normal or abnormal mark for the Metrics according to the monitoring rule;
step S103, receiving each Metrics of each working node, and analyzing all received Metrics to obtain a fault analysis report.
Further, the method further comprises:
and creating at least one corresponding BCC template to be selected according to the received information of the BCC tool and the monitoring rule corresponding to the specific monitoring requirement, and submitting the BCC template to the K8S cluster.
Further, the pulling the BCC tool CRD resource from the K8S cluster, obtaining a BCC tool based on analysis of the BCC tool CRD resource, and monitoring a working node by running the BCC tool to obtain monitoring result data on the working node, including:
pulling the BCC tool CRD resource from the K8S cluster, and analyzing to obtain a BCC tool corresponding to the target monitoring requirement on the working node based on the BCC tool CRD resource;
and on the working node, monitoring the working node by running the BCC tool in a daemon mode, and collecting monitoring result data.
Further, the pulling the BCC rule CRD resource from the K8S cluster includes:
pulling the BCC rule CRD resource from the K8S cluster through a Kubelet client program deployed on the working node;
the establishing normal or abnormal mark for the Metrics according to the monitoring rule comprises the following steps:
and collecting the Metrics data on the working node, judging whether the monitoring result data packaged in the Metrics triggers the threshold defined by the monitoring rule according to the monitoring rule packaged in the Metrics data, judging whether the change value of the monitoring result data obtained in the current monitoring period relative to the monitoring result data before the fixed period exceeds a preset threshold, and marking the Metrics data as abnormal if judging that the monitoring result data packaged in the Metrics triggers the threshold defined by the monitoring rule or the change value exceeds the preset threshold, otherwise marking the Metrics data as normal.
Further, the method further comprises:
and displaying the fault analysis report in a visual mode.
The embodiments of the present application are one-to-one corresponding to the foregoing system embodiments, and may be understood according to the foregoing system embodiments, which are not described herein again.
The embodiment of the application has the following technical effects: the BCC resource types comprise BCC rule CRD resources and BCC rule CRD resources through a CRD function in the K8S system, and a monitoring device and a monitoring result collecting device are deployed in a Daemoset mode in the K8S system, so that each working node automatically operates the monitoring device and the monitoring result collecting device, the BCC rule CRD resources and the BCC rule CRD resources can be obtained by the working nodes like obtaining K8S standard resources, and a needed target BCC template is selected from the BCC templates to be selected in combination to carry out multi-level monitoring on an operating system, application software, a container environment and/or a network environment, so that multi-level abnormal analysis and detection on the K8S system are realized. By acquiring the monitoring rule monitoring result data from the BCC rule CRD resource, the automatic analysis and display of the monitoring result data are realized, the problem that technicians of ordinary development, testing, operation and maintenance directly analyze lengthy data is avoided, and the use difficulty is reduced. The ordinary technicians do not need to be familiar with the BCC tools or know how to operate the BCC tools at each working node in the K8S system, and can apply the fault detection function of the corresponding BCC tools in the K8S system by selecting a proper target BCC template according to respective requirements, so that the technical threshold of the ordinary technicians is reduced.
The foregoing technical solutions of the embodiments of the present application will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.
The system comprises a plurality of components, each component is responsible for a part of functions, the system packages the BCC tool, provides a BCC template to be selected after the packaging, deploys the BCC template to the K8S cluster, and starts at each node of the cluster to more conveniently issue the BCC tool to all computing nodes.
The system comprises several parts and relationships described below:
K8S control terminal
As shown in fig. 1, an administrator of the K8S cluster is responsible for creating a template of the BCC tool, i.e. a candidate BCC template, for a common user to use, and encapsulates many details of the BCC tool during the process of creating the candidate BCC template, and reserves parameters frequently used in history use experience for the common user to configure. The created candidate BCC templates are submitted to the K8S cluster, in particular to the Etcd key value storage system.
K8S user
The application is deployed in the K8S cluster. In order to observe the running state of the application program on the K8S or when the application program is abnormal, the target BCC template is obtained by selecting a BCC template to be selected and configuring BCC parameters to complete the observational configuration of the service. And generating BCC tool CRD resources and BCC rule CRD resources according to the target BCC template, and submitting the BCC tool CRD resources and the BCC rule CRD resources to K8S. The BCC tool CRD resource is a K8S resource extension realized by the system, and comprises the attributes of a BCC tool, a name, an owner, how to run, how to return data, which nodes to run, and the like; the BCC rule CRD resource is also a K8S resource extension implemented by the present system, where the limitation of the returned results of the BCC tool is included, such as cpu data all obtain, memory all obtain, network data packets all parse those application layer protocols, and an operating system all collects those data, disabling those BCC programs, metrics thresholds, and other attributes.
Bcc client
The BCC client pulls the BCC tool CRD resources belonging to the present working node from the K8S cluster, and the BCC client program is deployed to all nodes in the cluster using the daemonset of K8S. The BCC tool CRD resources are parsed into BCC tools. The BCC client program improves the original way of use of BCC, which requires that a certain python program in the BCC tool set is used alone to start in command line state, which could not collect the output content of the program.
The BCC client program improves the original starting mode, and gives the improved starting mode to a BCC daemon (equivalent to a BCC tool executing device) to take over all the BCC tools and take over the output content of the BCC tools, so that the monitoring result data output by the BCC tools can be automatically obtained.
BCC daemon (i.e. BCC tool execution means in fig. 1)
The BCC client pushes the received BCC tools to a BCC daemon, which is responsible for running the BCC tools and for managing these BCC tools. And acquiring monitoring result data from the running BCC tool, and pushing the acquired monitoring result data to the BCC client.
Kubelet monitoring extension (i.e., metrics data generation device 122 in FIG. 1)
This component is an extension of kubelet. This component receives the monitoring result data of the BCC tool pushed by the BCC client and filters and analyzes the data according to the monitoring rules and the monitoring result data. Formatting the data, labeling the data, identifying the source of the data, and packaging the monitoring rules and the monitoring result data into Metrics data.
Kubelet (i.e. the monitoring rule acquisition means in fig. 1)
This component is a client program of the K8S deployed on each node. The component is responsible for the functions of the running state of the current node, running K8S instructions and the like, and supports custom expansion.
kubelet pulls the BCC rule CRD resource from K8S, parsing out the corresponding monitoring rule. The parsed monitoring rules are sent to a kubelet monitoring extension component.
Metrics data collector
The Metrics data collector receives Metrics, checks the format of Metircs data, and removes abnormal non-conforming Metrics data. And marking the threshold value in the monitoring rule to Metrics, and judging whether Metrics triggers the threshold value or not. And meanwhile, comparing the threshold value with a value before a fixed period, and giving a red mark to data exceeding the threshold value to indicate abnormality, and giving a green mark to indicate normal if the threshold value is not exceeded.
8. Analysis reporter
The analysis reporter receives the Metrics data of all nodes, analyzes the Metrics of the K8S cluster software at the operating system level, the Metrics of the application program and the Metrics of the network according to the abnormal Metrics data, marks the priority for the Metrics, and the higher the priority, the higher the possibility of abnormality of the Metrics.
The analysis reporter sends out the analysis result, namely the abnormal reason report. The Metrics API of K8S automatically collects Metrics and data analysis results of kubelet;
and displaying a large disc, displaying the association relation among the plurality of layers of Metrics, and connecting by using a green line, wherein when the data is abnormal, the green line turns red to indicate that the abnormality exists among the Metrics, and the reason is the root cause of the abnormality.
The embodiment of the application has the following technical effects: defining BCC resource types through a CRD function in the k8s system, and deploying BCC clients through a Daemoset mode in the k8s system, so that each appointed node automatically runs with the BCC client, and combining a BCC program to detect targets comprises: the monitoring of the operating system, the application software, the container environment and/or the network environment realizes multilevel abnormality analysis and detection of the k8s system, and the result data of the BCC program is analyzed through the BCC rule, so that the automatic analysis and display of the detection result are realized, the direct analysis of redundant data by technicians such as common development, test, operation and maintenance and the like is avoided, and the use difficulty is reduced. After the application program is deployed, each node can have a BCC monitoring function by selecting a BCC template at the k8s user side, the details of the BCC program are not required to be known, and the use threshold is reduced. Through the k8s system deployment personnel, corresponding BCC templates are created in advance according to various monitoring requirements, other technicians do not need to be familiar with BCC tools, and proper BCC templates can be directly selected according to respective requirements, so that the fault detection function of the k8s system can be used in a skilled manner.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, application lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising". Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the application may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present application.
The various illustrative logical blocks or units described in the embodiments of the application may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.
In one or more exemplary designs, the above-described functions of embodiments of the present application may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (10)

1. A fault monitoring system for a K8S cluster, comprising:
the K8S user side is used for selecting a target BCC template from preset BCC templates to be selected in the K8S cluster according to target monitoring requirements, generating a BCC tool CRD resource and a BCC rule CRD resource according to the target BCC template, and submitting the BCC tool CRD resource and the BCC rule CRD resource to the K8S cluster;
the monitoring device is deployed on each working node in the K8S cluster, is used for pulling the CRD resource of the BCC tool from the K8S cluster, analyzing the CRD resource of the BCC tool to obtain the BCC tool, monitoring the working node by running the BCC tool to obtain monitoring result data, and transmitting the monitoring result data to the monitoring result collecting device on the working node;
the monitoring result collection device is deployed on each working node in the K8S cluster, and is used for pulling the BCC rule CRD resource from the K8S cluster, analyzing the BCC rule CRD resource to obtain a monitoring rule, generating a metric of the working node according to the monitoring rule and monitoring result data of the working node, establishing a normal or abnormal mark for the metric according to the monitoring rule, and sending the normal or abnormal mark to an analysis reporter;
and the analysis reporter is used for receiving the Metrics of each working node, analyzing all received Metrics and obtaining a fault analysis report.
2. The fault monitoring system of a K8S cluster of claim 1, wherein the system further comprises: a K8S control end;
the K8S control end is used for creating at least one corresponding BCC template to be selected according to the received information of the BCC tool and the monitoring rule corresponding to the specific monitoring requirement and submitting the BCC template to the K8S cluster.
3. The fault monitoring system of a K8S cluster according to claim 1, wherein the monitoring means comprises:
the BCC client is used for pulling the CRD resources of the BCC tools from the K8S cluster, analyzing the CRD resources of the BCC tools to obtain the BCC tools corresponding to the target monitoring requirements, and submitting the obtained BCC tools to a BCC tool executing device on the working node; and receiving monitoring result data from the BCC tool execution device on the working node, and transmitting the received monitoring result data to the monitoring result collection device on the working node;
and the BCC tool executing device is used for receiving the BCC tool from the BCC client on the working node, monitoring the working node by running the BCC tool in a daemon way, collecting monitoring result data and transmitting the monitoring result data to the BCC client on the working node.
4. The fault monitoring system of a K8S cluster according to claim 1, wherein the monitoring result collecting means includes:
the monitoring rule acquisition device is used for pulling the BCC rule CRD resources from the K8S cluster through a Kubelet client program deployed on the working node, analyzing the BCC rule CRD resources to obtain monitoring rules, and transmitting the monitoring rules obtained by analysis to the Metrics data generation device on the working node;
the monitoring rule acquisition device is used for acquiring monitoring rules of the working node and monitoring result data of the monitoring device on the working node, and the received monitoring rules and the received monitoring result data are packaged into the Metrics data;
the system comprises a Metrics data collector, a analysis reporter and a monitoring system, wherein the Metrics data collector is used for collecting Metrics data on a working node, judging whether the monitoring result data packaged in the Metrics trigger a threshold defined by the monitoring rule according to the monitoring rule packaged in the Metrics data, judging whether the change value of the monitoring result data obtained in the current monitoring period relative to the monitoring result data before a fixed period exceeds a preset threshold, if the monitoring result data packaged in the Metrics trigger the threshold defined by the monitoring rule or the change value exceeds the preset threshold, marking the Metrics data as abnormal, otherwise marking the Metrics data as normal, and transmitting all Metrics to the analysis reporter.
5. The fault monitoring system of a K8S cluster of claim 1, wherein the system further comprises:
the display large disc is used for receiving the fault analysis report from the analysis reporter and displaying the fault analysis report in a visual mode.
6. A fault monitoring method for a K8S cluster, comprising:
selecting a target BCC template from preset BCC templates to be selected in the K8S cluster according to target monitoring requirements, generating a BCC tool CRD resource and a BCC rule CRD resource according to the target BCC template, and submitting the BCC tool CRD resource and the BCC rule CRD resource to the K8S cluster;
pulling the CRD resource of the BCC tool from the K8S cluster, analyzing the CRD resource of the BCC tool to obtain the BCC tool, and monitoring the working node by running the BCC tool to obtain monitoring result data on the working node;
pulling the BCC rule CRD resource from the K8S cluster, analyzing based on the BCC rule CRD resource to obtain a monitoring rule, generating a Metrics of the working node according to the monitoring rule and monitoring result data of the working node, and establishing a normal or abnormal mark for the Metrics according to the monitoring rule;
and receiving each Metrics of each working node, and analyzing all received Metrics to obtain a fault analysis report.
7. The fault monitoring method of a K8S cluster of claim 6, further comprising:
and creating at least one corresponding BCC template to be selected according to the received information of the BCC tool and the monitoring rule corresponding to the specific monitoring requirement, and submitting the BCC template to the K8S cluster.
8. The fault monitoring method of the K8S cluster according to claim 6, wherein the pulling the BCC tool CRD resource from the K8S cluster, obtaining a BCC tool based on the BCC tool CRD resource resolution, and monitoring the working node by running the BCC tool, to obtain monitoring result data on the working node, includes:
pulling the BCC tool CRD resource from the K8S cluster, and analyzing to obtain a BCC tool corresponding to the target monitoring requirement on the working node based on the BCC tool CRD resource;
and on the working node, monitoring the working node by running the BCC tool in a daemon mode, and collecting monitoring result data.
9. The fault monitoring method of a K8S cluster according to claim 6, wherein pulling the BCC rule CRD resource from the K8S cluster comprises:
pulling the BCC rule CRD resource from the K8S cluster through a Kubelet client program deployed on the working node;
the establishing normal or abnormal mark for the Metrics according to the monitoring rule comprises the following steps:
and collecting the Metrics data on the working node, judging whether the monitoring result data packaged in the Metrics triggers the threshold defined by the monitoring rule according to the monitoring rule packaged in the Metrics data, judging whether the change value of the monitoring result data obtained in the current monitoring period relative to the monitoring result data before the fixed period exceeds a preset threshold, and marking the Metrics data as abnormal if judging that the monitoring result data packaged in the Metrics triggers the threshold defined by the monitoring rule or the change value exceeds the preset threshold, otherwise marking the Metrics data as normal.
10. The fault monitoring method of a K8S cluster of claim 6, further comprising:
and displaying the fault analysis report in a visual mode.
CN202310756919.2A 2023-06-26 2023-06-26 Fault monitoring system and method for K8S cluster Pending CN116719608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310756919.2A CN116719608A (en) 2023-06-26 2023-06-26 Fault monitoring system and method for K8S cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310756919.2A CN116719608A (en) 2023-06-26 2023-06-26 Fault monitoring system and method for K8S cluster

Publications (1)

Publication Number Publication Date
CN116719608A true CN116719608A (en) 2023-09-08

Family

ID=87873205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310756919.2A Pending CN116719608A (en) 2023-06-26 2023-06-26 Fault monitoring system and method for K8S cluster

Country Status (1)

Country Link
CN (1) CN116719608A (en)

Similar Documents

Publication Publication Date Title
JP6336606B2 (en) Method and apparatus for visual network operation and maintenance
US10984013B1 (en) Tokenized event collector
CN102164045B (en) Parameterized computer monitor system and method
KR100453824B1 (en) XML based network management system and method for configuration management of heterogeneous network devices
US11921693B1 (en) HTTP events with custom fields
JP4343983B2 (en) Network management apparatus and network management method
US11829381B2 (en) Data source metric visualizations
EP2661014A1 (en) Polling sub-system and polling method for communication network system and communication apparatus
US10187272B2 (en) Interface management service entity, function service entity, and element management method
CN101206569A (en) Method, system and program product for dynamically identifying components contributing to service degradation
US20060116841A1 (en) Automated data collection and analysis
CN110968479B (en) Service level full-link monitoring method and server for application program
WO2018202440A1 (en) Data transmission method and apparatus
CN114124741B (en) Industrial Internet identification resolving capability test method and system
CN113810238A (en) Network monitoring method, electronic device and storage medium
CN116719608A (en) Fault monitoring system and method for K8S cluster
JP5735998B2 (en) Operation system
CN113032054B (en) Service execution method and device, storage medium and electronic device
JP3241648B2 (en) Network connection device management application development method
CN115396343B (en) Front-end page performance detection method and device, computer equipment and storage medium
CN109324951A (en) The acquisition methods and device of hard disk information in server
US10374898B2 (en) Network revision evaluator
CN110912919B (en) Network data acquisition method for network health condition modeling analysis
US20240134877A1 (en) Data source visualizations
KR20090044110A (en) Simulator and simulation method for testing network element and medium for storing for program for carrying out the method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination