CN114356499A - Kubernetes cluster alarm root cause analysis method and device - Google Patents

Kubernetes cluster alarm root cause analysis method and device Download PDF

Info

Publication number
CN114356499A
CN114356499A CN202111620209.4A CN202111620209A CN114356499A CN 114356499 A CN114356499 A CN 114356499A CN 202111620209 A CN202111620209 A CN 202111620209A CN 114356499 A CN114356499 A CN 114356499A
Authority
CN
China
Prior art keywords
alarm
time period
target time
cluster
root cause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111620209.4A
Other languages
Chinese (zh)
Inventor
杨启航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Scientific Research Institute Co Ltd
Original Assignee
Shandong Inspur Scientific Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Scientific Research Institute Co Ltd filed Critical Shandong Inspur Scientific Research Institute Co Ltd
Priority to CN202111620209.4A priority Critical patent/CN114356499A/en
Publication of CN114356499A publication Critical patent/CN114356499A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a Kubernetes cluster alarm root cause analysis method and a Kubernetes cluster alarm root cause analysis device, wherein the Kubernetes cluster alarm root cause analysis method comprises the following steps: acquiring alarm information of the Kubernetes cluster in a first target time period based on the alarm information and/or the log of the Kubernetes cluster in the first target time period; and based on a preset alarm root cause analysis rule, carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period, and acquiring a first root cause fault corresponding to the alarm information in the first target time period. The Kubernetes cluster alarm root cause analysis method and the Kubernetes cluster alarm root cause analysis device can quickly analyze the incidence relation among all alarms, quickly locate the root fault point generated by the alarm, improve the Kubernetes cluster alarm root cause analysis efficiency, maximally compress the consumption time of cluster operation and maintenance personnel, and reduce the manual monitoring cost of a cluster environment.

Description

Kubernetes cluster alarm root cause analysis method and device
Technical Field
The invention relates to the technical field of computers, in particular to a Kubernetes cluster alarm root cause analysis method and device.
Background
The Kubernetes cluster is used for managing containerization application on a plurality of hosts in a cloud platform, is an open-source platform, and can realize the functions of automatic deployment, automatic capacity expansion and reduction, maintenance and the like of the container cluster. A kubernets cluster may include multiple kubernets nodes, each of which may run one or more pods.
The Kubernetes cluster can rapidly deploy application, rapidly expand application, seamlessly interface new application functions, save resources and optimize the use of hardware resources.
The main functions of the kubernets cluster include: a plurality of pods working cooperatively; mounting a storage system; performing health detection on the application; replicating an application instance; pod auto-scaling/expansion; registering and discovering; load balancing; updating in a rolling mode; monitoring resources; log access; debugging the application program; and provide authentication and authorization, etc.
At present, the operation and maintenance of the Kubernetes cluster are mainly based on manual analysis, and after an alarm occurs, operations such as troubleshooting, log checking and the like need to be performed manually, so that root cause alarm is determined. Therefore, the prior art has the defects of low efficiency and the like.
Disclosure of Invention
The invention provides a Kubernetes cluster alarm root cause analysis method and a Kubernetes cluster alarm root cause analysis device, which are used for solving the defect of low efficiency in the prior art and realizing efficient and automatic alarm root cause analysis on a Kubernetes cluster.
The invention provides a Kubernetes cluster alarm root cause analysis method, which comprises the following steps:
acquiring alarm information of a first target time period of a Kubernetes cluster based on alarm information and/or logs of the Kubernetes cluster in the first target time period;
and based on a preset alarm root cause analysis rule, carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period, and acquiring a first root cause fault corresponding to the alarm information in the first target time period.
According to the root cause analysis method for the Kubernets cluster alarm, provided by the invention, based on the alarm message of the Kubernets cluster in the first target time period, the alarm message of the Kubernets cluster in the first target time period is acquired, and the method specifically comprises the following steps:
monitoring the Kubernets cluster based on Prometheus, and acquiring an alarm message of the Kubernets cluster in a first target time period;
and analyzing the alarm message of the Kubernetes cluster in the first target time period to acquire first alarm information of the Kubernetes cluster in the first target time period.
According to the Kubernetes cluster alarm root cause analysis method provided by the invention, alarm information of a Kubernetes cluster in a first target time period is acquired based on a log of the Kubernetes cluster in the first target time period, and the method specifically comprises the following steps:
acquiring a log of the Kubernets cluster in a first target time period based on an EFK log system and/or a kube-event;
and analyzing the log of the Kubernets cluster in the first target time period to acquire second alarm information of the Kubernets cluster in the first target time period.
According to the root cause analysis method for the Kubernets cluster alarm, provided by the invention, after the alarm information of the Kubernets cluster in the first target time period is acquired based on the alarm information and/or the log of the Kubernets cluster in the first target time period, the method further comprises the following steps:
acquiring historical alarm information of a second target time period of the Kubernetes cluster;
and performing alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period and the historical alarm information of the second target time period based on the alarm root cause analysis rule, and acquiring a second root cause fault corresponding to the alarm information of the first target time period.
According to the kubernets cluster alarm root cause analysis method provided by the invention, the monitoring of the kubernets cluster based on prometeus is performed to acquire the alarm message of the kubernets cluster in the first target time period, and the method specifically comprises the following steps:
monitoring the Kubernetes cluster based on each preset alarm rule to acquire an alarm message of the Kubernetes cluster in a first target time period;
the alarm rule is constructed by nesting a plurality of promQL expressions.
According to the kubernets cluster alarm root cause analysis method provided by the present invention, after the alarm information of the kubernets cluster in the first target time period is subjected to alarm root cause analysis based on a preset alarm root cause analysis rule, and a first root cause fault corresponding to the alarm information of the first target time period is acquired, the method further includes:
and sending the first root cause fault corresponding to the alarm information of the Kubernets cluster in the first target time period and the alarm information of the Kubernets cluster in the first target time period to a client, so that the client labels Kubernets nodes and pods generating alarms based on a topological graph of the Kubernets cluster.
The invention also provides a kubernets cluster alarm root cause analysis device, comprising:
the information acquisition module is used for acquiring the alarm information of the Kubernets cluster in the first target time period based on the alarm information and/or the log of the Kubernets cluster in the first target time period;
and the alarm analysis module is used for carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period based on a preset alarm root cause analysis rule, and acquiring a first root cause fault corresponding to the alarm information of the first target time period.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the Kubernets cluster alarm root cause analysis methods.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the kubernets cluster alarm root cause analysis methods described above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of any of the kubernets cluster alarm root cause analysis methods described above.
The Kubernetes cluster alarm root cause analysis method and device provided by the invention are characterized in that alarm information of a first target time period of a Kubernetes cluster is obtained based on alarm information and/or logs of the first target time period of the Kubernetes cluster, alarm root cause analysis is carried out on the alarm information of the first target time period of the Kubernetes cluster based on a preset alarm root cause analysis rule, a first root cause fault corresponding to the alarm information of the first target time period is obtained, the association relation among all alarms can be rapidly analyzed, various alarm information of a cluster environment can be completely collected, a root fault point generated by the alarm can be rapidly positioned, the efficiency of Kubernetes cluster alarm root cause analysis can be improved, the consumed time of cluster operation and maintenance personnel can be maximally compressed, the manual monitoring cost of the cluster environment can be reduced, the normal state of the cluster can be rapidly recovered, and the influence of cluster abnormality on services can be reduced, the operation and maintenance side does not need to pay attention to the cluster running state excessively any more, the troubleshooting efficiency of the operation and maintenance side can be greatly improved, the cluster monitoring efficiency is improved, the failure resolution rate is rapidly improved, the service interruption time can be shortened to the maximum extent, and the operation and maintenance side can be promoted to develop in a more intelligent and efficient direction.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a Kubernetes cluster alarm root cause analysis method provided by the present invention;
FIG. 2 is an architecture diagram of a Kubernetes cluster alarm root cause analysis method provided by the present invention;
FIG. 3 is a timing diagram of a Kubernetes cluster alarm root cause analysis method provided by the present invention;
FIG. 4 is a flow chart of a Kubernetes cluster alarm root cause analysis method provided by the present invention;
FIG. 5 is a schematic structural diagram of a Kubernetes cluster alarm root cause analysis device provided by the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the embodiments of the invention, the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, nor order.
In the description of the embodiments of the present invention, it should be noted that, unless explicitly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. Specific meanings of the above terms in the embodiments of the present invention can be understood in specific cases by those of ordinary skill in the art.
The Kubernetes cluster alarm root cause analysis method and device provided by the invention are described below with reference to fig. 1 to 6.
Fig. 1 is a schematic flow chart of a kubernets cluster alarm root cause analysis method provided in the present application. The Kubernetes cluster alarm root cause analysis method according to the embodiment of the present application is described below with reference to fig. 1. As shown in fig. 1, the method includes: step 101 and step 102.
Specifically, the main execution body of the kubernets cluster alarm root cause analysis method provided by the embodiment of the present invention is a kubernets cluster alarm root cause analysis device. The Kubernetes cluster alarm root cause analysis device can be distributed and deployed in the Kubernetes cluster.
Step 101, acquiring alarm information of the Kubernets cluster in the first target time period based on the alarm information and/or the log of the Kubernets cluster in the first target time period.
Specifically, before step 101, a plurality of virtual machines have been created in advance, and a Kubernetes cluster is deployed. Preferably, an odd number of virtual machines may be created in advance.
The alarm information of the Kubernetes cluster can be collected through multiple sources and multiple modes.
The first target time period is a preset time period with a certain duration. The embodiment of the present invention is not particularly limited with respect to the specific duration of the first target time period.
Preferably, the first target time period is a current time period. For example, in the case of a time period of one day (24 hours), the first target time period may be the same day; in the case of a one-hour time period, the first target time period may be 10:00-11:00 in the case of a current time of 10: 50.
The Kubernetes cluster may generate a large number of redundant logs and alarm messages from the time of deployment. All log information on the kubernets cluster can be collected in real time and stored in the ES library.
Optionally, the kubernets cluster may be monitored by any cluster monitoring method (e.g., Prometheus, Zabbix, or Open-Falcon, etc.), and an alarm message of the kubernets cluster in the first target time period is acquired; and extracting information of the node and the pod generating the alarm, and alarm information such as the grade, the type, the occurrence time and the specific description of the alarm from the alarm message of the Kubernetes cluster in the first target time period.
Optionally, the log of the kubernets cluster in the first target time period may be acquired based on any log generation method; and extracting information of the node and the pod generating the alarm, and alarm information such as the grade, the type, the occurrence time and the specific description of the alarm from the log of the first target time period of the Kubernetes cluster.
It should be noted that the customized alarm information is supported, that is, the format of the alarm information can be customized.
102, based on a preset alarm root cause analysis rule, carrying out alarm root cause analysis on alarm information of the Kubernetes cluster in a first target time period, and acquiring a first root cause fault corresponding to the alarm information of the first target time period.
Specifically, alarm root cause analysis may be performed on the alarm information of the kubernets cluster in the first target time period based on the association relationship field in the alarm root cause analysis rule, and a cause-and-effect relationship between the alarms in the kubernets cluster in the first target time period is determined, so as to obtain an alarm root cause analysis result, that is, a first root cause fault corresponding to the alarm information of the kubernets cluster in the first target time period is obtained, where the first root cause fault is one or more alarms in the first target time period.
Optionally, the kubecect apply deploys an alarm root cause analysis component, deployed with a daemoset resource. And the alarm root cause analysis component is used for carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period based on a preset alarm root cause analysis rule and acquiring a first root cause fault corresponding to the alarm information of the first target time period.
Alternatively, the alarm root cause analysis result (first root cause failure) may be returned to the specified Kafka in json form.
It should be noted that the kubernets cluster alarm root cause analysis device supports data protection in abnormal situations such as power failure. When the environment is in abnormal power failure or breakdown state, the state before the abnormality can be recovered through restarting, the alarm information can be kept at the latest time before the abnormality occurs, and the phenomenon that the alarm data is redundant to interfere the operation and maintenance personnel to remove the fault due to repeated sending of the historical alarm information caused by the abnormal state is avoided.
The embodiment of the invention obtains the alarm information of the first target time period of the Kubernetes cluster based on the alarm message and/or log of the first target time period of the Kubernetes cluster, carries out alarm root cause analysis on the alarm information of the first target time period of the Kubernetes cluster based on the preset alarm root cause analysis rule, obtains the first root cause fault corresponding to the alarm information of the first target time period, can rapidly analyze the association relation among all alarms, can completely collect various alarm information of the cluster environment, can rapidly position the root fault point generated by the alarm, can improve the efficiency of the Kubernetes cluster alarm root cause analysis, can maximally compress the consumption time of cluster operation and maintenance personnel, can reduce the manual monitoring cost of the cluster environment, thereby more rapidly recovering the normal state of the cluster, can reduce the influence of the cluster abnormality on the service, and an operation and maintenance side does not excessively pay attention to the cluster operation state, the troubleshooting efficiency of the operation and maintenance side can be greatly improved, the cluster monitoring efficiency is improved, the failure resolution rate is rapidly improved, the service interruption time can be shortened to the maximum extent, and the operation and maintenance side can be promoted to develop towards a more intelligent and efficient direction.
Based on the content of any of the above embodiments, acquiring the alarm information of the kubernets cluster at the first target time period based on the alarm information of the kubernets cluster at the first target time period specifically includes: monitoring the Kubernets cluster based on Prometheus, and acquiring the alarm message of the Kubernets cluster in the first target time period.
Specifically, the kubernets cluster may be deployed with Prometheus and Kafka components, among others.
The Kafka component can be used as a message middleware and is used for transmitting alarm information and alarm root cause analysis results inside and outside the Kubernets cluster alarm root cause analysis device.
Prometheus is a suite of open source system monitoring alarm frameworks. Prometheus provides a multi-dimensional data model and a flexible query mode, supports local storage of server nodes, defines an open index data standard, supports discovery of monitoring objects through a static file configuration and dynamic discovery mechanism, automatically completes data acquisition, is easy to maintain, supports partition sampling and federal deployment of data, and supports large-scale cluster monitoring.
The kubernets cluster can be monitored through a Prometheus component to obtain the alarm message of the kubernets cluster in the first target time period.
And analyzing the alarm message of the Kubernetes cluster in the first target time period to acquire first alarm information of the Kubernetes cluster in the first target time period.
Specifically, the local port number may be listenandserve to receive an alert message body (body) sent by an alert manager by calling a third-party packet http.
For analysis of the alarm message, http, HandleFunc can be called to construct an analysis method for an alarm message body, key abnormal information such as node, namespace, pod information, generation reasons and specific abnormal information can be analyzed from the analysis method, and therefore a structural message body (namely first alarm information) is formed and sent to a specified Kafka topic in a json character string mode, impact on a cluster during reading and writing of a large amount of data is reduced, and impact on a cluster system caused by reading operation of a large amount of useless information is reduced.
Alternatively, the structured message body may name the alarm information in the node _ namespace _ pod format.
It should be noted that, before analyzing the alarm message of the kubernets cluster in the first target time period, the alarm message of the kubernets cluster in the first target time period may be filtered, deduplicated, compressed, truncated, and marked.
For the cluster logs and alarm messages which are comprehensively collected, a large amount of redundant and repeated data exist, the information can be further filtered and screened, then key abnormal information is extracted, and recombination is carried out according to a uniform format, so that a structural message body is obtained.
The log and the alarm information are filtered, screened, recombined and the like, abnormal information is formatted, redundant information is removed, and the problem troubleshooting efficiency can be improved.
Optionally, the first alarm information is a structural message body, and after the first alarm information of the kubernets cluster in the first target time period is acquired, the first alarm information of the kubernets cluster in the first target time period may be persistently stored in a storage system such as a Redis.
The embodiment of the invention monitors the Kubernets cluster based on Prometous, acquires the alarm message of the Kubernets cluster in the first target time period, analyzes the alarm message of the Kubernets cluster in the first target time period, acquires the first alarm message of the Kubernets cluster in the first target time period, and can acquire the alarm message of the Kubernets cluster more quickly and comprehensively.
Based on the content of any of the embodiments, monitoring the kubernets cluster based on Prometheus, and acquiring the alarm message of the kubernets cluster in the first target time period specifically include: monitoring the Kubernetes cluster based on each preset alarm rule to acquire an alarm message of the Kubernetes cluster in a first target time period; the alarm rule is constructed by nesting a plurality of promQL expressions.
Specifically, the generation of the alarm message may be based on an alarm rule base accumulated in daily work.
The warning rules library may include a plurality of warning rules. Each alarm rule is constructed by nesting a plurality of promQL expressions.
QL is Query language (Query language). The query language for Prometheus is promQL.
The Prometheus can acquire the alarm message based on the alarm rule which is constructed by nesting a plurality of promQL expressions.
According to the embodiment of the invention, the alarm message of the Kubernetes cluster in the first target time period is obtained through the alarm rule which is constructed based on the nesting of a plurality of promQL expressions, and the alarm message can be obtained more quickly and comprehensively.
Based on the content of any of the above embodiments, acquiring alarm information of the kubernets cluster in the first target time period based on the log of the kubernets cluster in the first target time period specifically includes: and acquiring a log of the first target time period of the Kubernets cluster based on the EFK log system and/or the kube-event.
Specifically, the kubernets cluster may be deployed with components such as the EFK log system, kube-evener, and Kafka.
The EFK journal system includes Elastic Search (ES), FileBeat, and Kibana. The ELASTIC search is responsible for log storage and search, the FileBeat is responsible for log collection, and the Kibana is responsible for interfaces.
When the alarm information is read from the ES library, a position variable can be set firstly to record the current alarm information reading position and store the current alarm information reading position into the ES library; and then sorting alarm information characteristic values in an ascending order, judging whether the current alarm information is read or not according to the sorted sort value, assigning the sort maximum value of the current read alarm information to a position variable and storing the position variable into an ES library, reading the alarm information from the alarm information larger than the sort maximum value of the position variable when the alarm information is read next time, and polling and reading the alarm information of the ES library at intervals of 1 minute (1 minute is an exemplary example and is not limited) according to the logic. And finally, formatting the alarm information and then sending the alarm information to the designated Kafka topoic.
And the multi-instance concurrent operation is reliable, and due to the fact that the alarm analysis module is deployed and clustered by using the deployment resource, a plurality of copies can simultaneously carry out read-write operation on the ES library, and concurrent control is needed to prevent the occurrence of conflict and abnormity of the write operation. Setting a writable variable of the ES library, recording whether the current state of the ES library can be written, and directly performing read-write operation when the writable variable represents writable; when the writable operation variable indicates non-writable, the current program is allowed to sleep for 3 seconds (3 seconds is an exemplary example and not limited thereto) and then the read/write attempt is continued.
kube-event is the open source component in kubernets for event monitoring, alarm, chatOps scenarios.
And analyzing the log of the Kubernetes cluster in the first target time period to acquire second alarm information of the Kubernetes cluster in the first target time period.
Specifically, for analysis of the log, a time. new tracker can be called to construct a timer, log information in an ES library is queried and sequenced in batches through an index every 60 seconds, logs at error and warning levels are screened out, and key abnormal information such as node, namespace, pod information, generation reason, specific abnormal information and the like is analyzed from the logs, so that a structural message body (namely, second alarm information) is formed and sent to a specified Kafka topoic in a json character string form, impact on the cluster during reading and writing of a large amount of data is reduced, and impact on a cluster system caused by reading of a large amount of useless information is reduced.
Alternatively, the structured message body may name the alarm information in the node _ namespace _ pod format.
It should be noted that before parsing the log of the kubernets cluster in the first target time period, the log of the kubernets cluster in the first target time period may be filtered, deduplicated, compressed, truncated, and marked.
For the cluster logs and alarm messages which are comprehensively collected, a large amount of redundant and repeated data exist, the information can be further filtered and screened, then key abnormal information is extracted, and recombination is carried out according to a uniform format, so that a structural message body is obtained.
The log and the alarm information are filtered, screened, recombined and the like, abnormal information is formatted, redundant information is removed, and the problem troubleshooting efficiency can be improved.
Optionally, the second alarm information is a structural message body, and after the second alarm information of the kubernets cluster in the first target time period is acquired, the second alarm information of the kubernets cluster in the first target time period may be persistently stored in a storage system such as a Redis.
The embodiment of the invention monitors the Kubernets cluster based on Prometous, acquires the log of the Kubernets cluster in the first target time period, analyzes the log of the Kubernets cluster in the first target time period, acquires the second alarm information of the Kubernets cluster in the first target time period, and can acquire the alarm information of the Kubernets cluster more quickly and comprehensively.
Based on the content of any of the above embodiments, after acquiring the alarm information of the kubernets cluster in the first target time period based on the alarm information and/or the log of the kubernets cluster in the first target time period, the method further includes: and acquiring historical alarm information of a second target time period of the Kubernetes cluster.
Specifically, in the case of persistent storage of the historical alarm information and storage modules such as Redis, the historical alarm information of a second target time period before the first target time period may also be acquired.
The second target time period is a preset time period with a certain duration. The specific duration of the second time period is not specifically limited in the embodiments of the present invention.
Preferably, in the case where the first target period is the same day (21 days of a certain month), the second target period may be yesterday (20 days of the month), or three days before the same day (18-20 days of the month), or the like; where the first target time period may be 19:00-20:00, the second target time period may be 18:00-19:00 or 17:00-19:00, etc.
And based on an alarm root cause analysis rule, carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period and the historical alarm information of the second target time period, and acquiring a second root cause fault corresponding to the alarm information of the first target time period.
Specifically, alarm root cause analysis may be performed on alarm information of the kubernets cluster in the first target time period and historical alarm information of the kubernets cluster in the second target time period based on an association relationship field in an alarm root cause analysis rule, and a causal relationship between alarms of the kubernets cluster in the first target time period and the second target time period is determined, so that an alarm root cause analysis result is obtained, that is, a second root cause fault corresponding to the alarm information of the kubernets cluster in the first target time period is obtained, and the second root cause fault is an alarm in the first target time period or the second target time period. There may be one or more second root cause failures.
Alternatively, the alarm root cause analysis result (second root cause failure) may be returned to the designated Kafka in json form.
The embodiment of the invention carries out alarm root cause analysis on the alarm information of a Kubernetes cluster in a first target time period and the historical alarm information of a second target time period based on the alarm root cause analysis rule, obtains a second root cause fault corresponding to the alarm information of the first target time period, can quickly analyze the incidence relation among all alarms, can completely collect various alarm information of a cluster environment, can quickly locate the root fault point generated by the alarm, can improve the efficiency of the Kubernetes cluster alarm root cause analysis, can maximally compress the consumption time of cluster operation and maintenance personnel, can reduce the manual monitoring cost of the cluster environment, thereby more quickly recovering the normal state of the cluster, reducing the influence of cluster abnormality on the service, avoiding over-paying attention to the cluster operation state on the operation and maintenance side, greatly improving the fault troubleshooting efficiency of the operation and maintenance side, improving the cluster monitoring efficiency and quickly improving the fault resolution rate, the method can shorten the service interruption time to the maximum extent, and can promote the development of the operation and maintenance side direction in a more intelligent and efficient direction.
Based on the content of any of the above embodiments, based on a preset alarm root cause analysis rule, performing alarm root cause analysis on alarm information of a Kubernetes cluster in a first target time period, and after acquiring a first root cause fault corresponding to the alarm information of the first target time period, the method further includes: and sending the first root cause fault corresponding to the alarm information of the Kubernets cluster in the first target time period and the alarm information of the Kubernets cluster in the first target time period to the client, so that the client labels the Kubernets nodes and the pod generating the alarm based on the topological graph of the Kubernets cluster.
Specifically, after acquiring the first root cause fault corresponding to the alarm information of the first target time period, the Kubernetes cluster alarm root cause analysis device may send the alarm information of the first target time period and the first root cause fault corresponding to the alarm information of the first target time period to the client located at the front end.
After the client receives the alarm information of the first target time period and the first root cause fault corresponding to the alarm information of the first target time period, the client can automatically position the cluster fault point so as to display the current state of the cluster in a panoramic way in a red marking mode in a cluster topological graph of a Web interface.
A user can log in a cluster management and control Web interface and check alarm information and a topology structure diagram, an alarm node or pod can turn red (namely, the alarm node or pod is highlighted in red but not limited to red), the cluster topology diagram can be displayed in a panoramic way, relevant abnormal nodes are marked, and the running state of the cluster can be clear at a glance.
It can be understood that, after the second root cause fault corresponding to the alarm information in the second target time period is acquired, the second root cause fault corresponding to the alarm information in the first target time period and the alarm information in the second target time period of the kubernets cluster may also be sent to the client, so that the client labels the kubernets node and pod generating the alarm based on the topological graph of the kubernets cluster.
According to the embodiment of the invention, the alarm information of the Kubernets cluster in the first target time period and the first root cause fault corresponding to the alarm information of the first target time period are sent to the client, so that the client labels Kubernets nodes and pod generating alarms based on a topological graph of the Kubernets cluster, the cluster fault nodes and pod can be clearly marked, operation and maintenance personnel can be helped to quickly locate fault sources, the operation and maintenance side does not need to pay attention to the running state of the cluster, the fault troubleshooting efficiency of the operation and maintenance side can be greatly improved, the cluster monitoring efficiency is improved, the fault resolution rate is quickly improved, the service interruption time can be maximally shortened, and the operation and maintenance side can be promoted to develop towards a more intelligent and efficient direction.
Exemplarily, fig. 2 shows an architecture of the kubernets cluster alarm root cause analysis method, fig. 3 shows a time sequence of the kubernets cluster alarm root cause analysis method, and fig. 4 shows the kubernets cluster alarm root cause analysis method.
As shown in fig. 2 to 4, Kafka may be used as message middleware for receiving alarms and replying to alarm associations; redis may be used to store historical alerts; the alarm analysis rules may be stored using MySQL. The analysis engine may use easy-rules as a rule engine and choose Jexl as a regular expression engine. The Kubernetes cluster alarm root cause analysis method operates in a micro-service mode.
And the OMC system reads the json character string of the alarm information from the specified Kafka topoic at fixed time intervals and sends the alarm information to the alarm analysis module. The alarm analysis module acquires alarm information from the specified Kafka topoic and stores the currently received alarm json character string into a Redis database; inquiring current and historical alarms from Redis; and analyzing the alarm causal association relation by inquiring the alarm rule in MySQL, returning the result to the specified Kafka topoic, and acquiring and displaying the analysis result by the control interface.
Namely, the alarm analysis module consumes the message from the specified Kafka topoic and analyzes the association relation of the alarm information according to the alarm root cause analysis rule.
The kubernets cluster alarm root cause analysis device provided by the present invention is described below, and the kubernets cluster alarm root cause analysis device described below and the kubernets cluster alarm root cause analysis method described above may be referred to each other.
Fig. 5 is a schematic structural diagram of a kubernets cluster alarm root cause analysis device provided by the present invention. Based on the content of any of the above embodiments, as shown in fig. 5, the apparatus includes an information obtaining module 501 and an alarm analyzing module 502, where:
an information obtaining module 501, configured to obtain alarm information of a kubernets cluster in a first target time period based on an alarm message and/or a log of the kubernets cluster in the first target time period;
the alarm analysis module 502 is configured to perform alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period based on a preset alarm root cause analysis rule, and obtain a first root cause fault corresponding to the alarm information in the first target time period.
Specifically, the information acquisition module 501 is electrically connected to the alarm analysis module 502.
The information acquisition module 501 may collect alarm information of the kubernets cluster through multiple sources and multiple modes.
The alarm analysis module 502 may perform alarm root cause analysis on the alarm information of the kubernets cluster in the first target time period based on the association relationship field in the alarm root cause analysis rule, and determine the cause-effect relationship between the alarms in the kubernets cluster in the first target time period, so as to obtain an alarm root cause analysis result, that is, obtain a first root cause fault corresponding to the alarm information of the kubernets cluster in the first target time period, where the first root cause fault is one or more alarms in the first target time period.
Optionally, the information obtaining module 501 may include:
the device comprises a message acquisition unit, a message sending unit and a message sending unit, wherein the message acquisition unit is used for monitoring a Kubernets cluster based on Prometous and acquiring an alarm message of the Kubernets cluster in a first target time period;
and the message analysis unit is used for analyzing the alarm message of the Kubernetes cluster in the first target time period and acquiring the first alarm message of the Kubernetes cluster in the first target time period.
Optionally, the information obtaining module 501 may include:
the system comprises a log obtaining unit, a log obtaining unit and a log analyzing unit, wherein the log obtaining unit is used for obtaining a log of a Kubernets cluster in a first target time period based on an EFK log system and/or a kube-event;
and the log analyzing unit is used for analyzing the log of the Kubernets cluster in the first target time period and acquiring second alarm information of the Kubernets cluster in the first target time period.
Optionally, the kubernets cluster alarm root cause analysis device may further include a history alarm obtaining module, configured to obtain history alarm information of a second target time period of the kubernets cluster;
the alarm analysis module 502 may be further configured to perform alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period and the historical alarm information of the second target time period based on an alarm root cause analysis rule, and obtain a second root cause fault corresponding to the alarm information of the first target time period.
Optionally, the message obtaining unit may be specifically configured to monitor the kubernets cluster based on each preset alarm rule, and obtain an alarm message of the kubernets cluster in the first target time period;
the alarm rule is constructed by nesting a plurality of promQL expressions.
Optionally, the kubernets cluster alarm root cause analysis device may further include a result sending module, configured to send, to the client, a first root cause fault corresponding to the alarm information of the first target time period and the alarm information of the first target time period of the kubernets cluster, so that the client labels, based on a topological graph of the kubernets cluster, the kubernets node and the pod that generate the alarm.
The kubernets cluster alarm root cause analysis device provided by the embodiment of the invention is used for executing the kubernets cluster alarm root cause analysis method provided by the invention, the implementation mode of the kubernets cluster alarm root cause analysis device is consistent with the implementation mode of the kubernets cluster alarm root cause analysis method provided by the invention, the same beneficial effects can be achieved, and the kubernets cluster alarm root cause analysis device is not described herein again.
The Kubernetes cluster alarm root cause analysis device is used for the Kubernetes cluster alarm root cause analysis method in each of the aforementioned embodiments. Therefore, the description and definition in the kubernets cluster alarm root cause analysis method in the foregoing embodiments may be used for understanding each execution module in the embodiments of the present invention.
The embodiment of the invention obtains the alarm information of the first target time period of the Kubernetes cluster based on the alarm message and/or log of the first target time period of the Kubernetes cluster, carries out alarm root cause analysis on the alarm information of the first target time period of the Kubernetes cluster based on the preset alarm root cause analysis rule, obtains the first root cause fault corresponding to the alarm information of the first target time period, can rapidly analyze the association relation among all alarms, can completely collect various alarm information of the cluster environment, can rapidly position the root fault point generated by the alarm, can improve the efficiency of the Kubernetes cluster alarm root cause analysis, can maximally compress the consumption time of cluster operation and maintenance personnel, can reduce the manual monitoring cost of the cluster environment, thereby more rapidly recovering the normal state of the cluster, can reduce the influence of the cluster abnormality on the service, and an operation and maintenance side does not excessively pay attention to the cluster operation state, the troubleshooting efficiency of the operation and maintenance side can be greatly improved, the cluster monitoring efficiency is improved, the failure resolution rate is rapidly improved, the service interruption time can be shortened to the maximum extent, and the operation and maintenance side can be promoted to develop towards a more intelligent and efficient direction.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a kubernets cluster alarm root cause analysis method comprising: acquiring alarm information of the Kubernetes cluster in a first target time period based on the alarm information and/or the log of the Kubernetes cluster in the first target time period; and based on a preset alarm root cause analysis rule, carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period, and acquiring a first root cause fault corresponding to the alarm information in the first target time period.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The processor 610 in the electronic device provided in the embodiment of the present application may call a logic instruction in the memory 630, and an implementation manner of the processor is consistent with an implementation manner of the kubernets cluster alarm root cause analysis method provided in the present application, and the same beneficial effects may be achieved, and details are not described here.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the kubernets cluster alarm root cause analysis method provided by the above methods, the method comprising: acquiring alarm information of the Kubernetes cluster in a first target time period based on the alarm information and/or the log of the Kubernetes cluster in the first target time period; and based on a preset alarm root cause analysis rule, carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period, and acquiring a first root cause fault corresponding to the alarm information in the first target time period.
When the computer program product provided in the embodiment of the present application is executed, the kubernets cluster alarm root cause analysis method is implemented, and a specific implementation manner of the kubernets cluster alarm root cause analysis method is consistent with the implementation manner described in the embodiment of the foregoing method, and the same beneficial effects can be achieved, and details are not described here.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, is implemented to perform the kubernets cluster alarm root cause analysis methods provided above, the method comprising: acquiring alarm information of the Kubernetes cluster in a first target time period based on the alarm information and/or the log of the Kubernetes cluster in the first target time period; and based on a preset alarm root cause analysis rule, carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period, and acquiring a first root cause fault corresponding to the alarm information in the first target time period.
When a computer program stored on a non-transitory computer-readable storage medium provided in the embodiment of the present application is executed, the kubernets cluster alarm root cause analysis method is implemented, and a specific implementation manner of the kubernets cluster alarm root cause analysis method is consistent with the implementation manner described in the embodiment of the foregoing method, and the same beneficial effects can be achieved, and details are not described here.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A Kubernetes cluster alarm root cause analysis method is characterized by comprising the following steps:
acquiring alarm information of a first target time period of a Kubernetes cluster based on alarm information and/or logs of the Kubernetes cluster in the first target time period;
and based on a preset alarm root cause analysis rule, carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period, and acquiring a first root cause fault corresponding to the alarm information in the first target time period.
2. The kubernets cluster alarm root cause analysis method according to claim 1, wherein the obtaining of the alarm information of the kubernets cluster in the first target time period based on the alarm information of the kubernets cluster in the first target time period specifically includes:
monitoring the Kubernets cluster based on Prometheus, and acquiring an alarm message of the Kubernets cluster in a first target time period;
and analyzing the alarm message of the Kubernetes cluster in the first target time period to acquire first alarm information of the Kubernetes cluster in the first target time period.
3. The kubernets cluster alarm root cause analysis method according to claim 1, wherein acquiring alarm information of a kubernets cluster at a first target time period based on a log of the kubernets cluster at the first target time period specifically includes:
acquiring a log of the Kubernets cluster in a first target time period based on an EFK log system and/or a kube-event;
and analyzing the log of the Kubernets cluster in the first target time period to acquire second alarm information of the Kubernets cluster in the first target time period.
4. The kubernets cluster alarm root cause analysis method according to claim 1, wherein after acquiring alarm information of a kubernets cluster in a first target time period based on alarm information and/or a log of the kubernets cluster in the first target time period, the method further comprises:
acquiring historical alarm information of a second target time period of the Kubernetes cluster;
and performing alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period and the historical alarm information of the second target time period based on the alarm root cause analysis rule, and acquiring a second root cause fault corresponding to the alarm information of the first target time period.
5. The kubernets cluster alarm root cause analysis method according to claim 2, wherein the monitoring of the kubernets cluster based on Prometheus to obtain the alarm message of the kubernets cluster in the first target time period specifically includes:
monitoring the Kubernetes cluster based on each preset alarm rule to acquire an alarm message of the Kubernetes cluster in a first target time period;
the alarm rule is constructed by nesting a plurality of promQL expressions.
6. The kubernets cluster alarm root cause analysis method according to any one of claims 1 to 5, wherein the performing alarm root cause analysis on the alarm information of the kubernets cluster in the first target time period based on a preset alarm root cause analysis rule, and after acquiring a first root cause fault corresponding to the alarm information of the first target time period, further includes:
and sending the first root cause fault corresponding to the alarm information of the Kubernets cluster in the first target time period and the alarm information of the Kubernets cluster in the first target time period to a client, so that the client labels Kubernets nodes and pods generating alarms based on a topological graph of the Kubernets cluster.
7. A Kubernetes cluster alarm root cause analysis device is characterized by comprising:
the information acquisition module is used for acquiring the alarm information of the Kubernets cluster in the first target time period based on the alarm information and/or the log of the Kubernets cluster in the first target time period;
and the alarm analysis module is used for carrying out alarm root cause analysis on the alarm information of the Kubernetes cluster in the first target time period based on a preset alarm root cause analysis rule, and acquiring a first root cause fault corresponding to the alarm information of the first target time period.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the kubernets cluster alarm root cause analysis method according to any one of claims 1 to 6 are implemented when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the kubernets cluster alarm root cause analysis method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the kubernets cluster alarm root cause analysis method according to any of claims 1 to 6.
CN202111620209.4A 2021-12-27 2021-12-27 Kubernetes cluster alarm root cause analysis method and device Pending CN114356499A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111620209.4A CN114356499A (en) 2021-12-27 2021-12-27 Kubernetes cluster alarm root cause analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111620209.4A CN114356499A (en) 2021-12-27 2021-12-27 Kubernetes cluster alarm root cause analysis method and device

Publications (1)

Publication Number Publication Date
CN114356499A true CN114356499A (en) 2022-04-15

Family

ID=81103713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111620209.4A Pending CN114356499A (en) 2021-12-27 2021-12-27 Kubernetes cluster alarm root cause analysis method and device

Country Status (1)

Country Link
CN (1) CN114356499A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500249A (en) * 2022-04-18 2022-05-13 中国工商银行股份有限公司 Root cause positioning method and device
CN115766402A (en) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 Method and device for filtering fault root cause of server, storage medium and electronic device
CN115827398A (en) * 2023-02-24 2023-03-21 天翼云科技有限公司 Method and device for calculating alarm information component value, electronic equipment and storage medium
CN116932148A (en) * 2023-09-19 2023-10-24 山东浪潮数据库技术有限公司 Problem diagnosis system and method based on AI

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500249A (en) * 2022-04-18 2022-05-13 中国工商银行股份有限公司 Root cause positioning method and device
CN114500249B (en) * 2022-04-18 2022-07-08 中国工商银行股份有限公司 Root cause positioning method and device
CN115766402A (en) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 Method and device for filtering fault root cause of server, storage medium and electronic device
CN115766402B (en) * 2023-01-09 2023-04-28 苏州浪潮智能科技有限公司 Method and device for filtering server fault root cause, storage medium and electronic device
CN115827398A (en) * 2023-02-24 2023-03-21 天翼云科技有限公司 Method and device for calculating alarm information component value, electronic equipment and storage medium
CN115827398B (en) * 2023-02-24 2023-06-23 天翼云科技有限公司 Method and device for calculating component value of alarm information, electronic equipment and storage medium
CN116932148A (en) * 2023-09-19 2023-10-24 山东浪潮数据库技术有限公司 Problem diagnosis system and method based on AI
CN116932148B (en) * 2023-09-19 2024-01-19 山东浪潮数据库技术有限公司 Problem diagnosis system and method based on AI

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN110224858B (en) Log-based alarm method and related device
US20210064500A1 (en) System and Method for Detecting Anomalies by Discovering Sequences in Log Entries
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN104731580A (en) Automation operation and maintenance system based on Karaf and ActiveMQ and implement method thereof
CN112311617A (en) Configured data monitoring and alarming method and system
CN109034423B (en) Fault early warning judgment method, device, equipment and storage medium
CN110716842A (en) Cluster fault detection method and device
CN112698915A (en) Multi-cluster unified monitoring alarm method, system, equipment and storage medium
CN112636979B (en) Cluster alarm method and related device
CN110874291A (en) Real-time detection method for abnormal container
CN111901172B (en) Application service monitoring method and system based on cloud computing environment
CN114154035A (en) Data processing system for dynamic loop monitoring
CN112149975B (en) APM monitoring system and method based on artificial intelligence
US9443196B1 (en) Method and apparatus for problem analysis using a causal map
CN115981950A (en) Monitoring alarm method, device, equipment and computer readable storage medium
CN116881100A (en) Log detection method, log alarm method, system, equipment and storage medium
CN112416719B (en) Monitoring processing method, system, equipment and storage medium for database container
CN112882892B (en) Data processing method and device, electronic equipment and storage medium
CN115525392A (en) Container monitoring method and device, electronic equipment and storage medium
CN114363150A (en) Network card connectivity monitoring method and device for server cluster
CN113157555A (en) System, method and equipment for online pressure measurement data leakage library real-time detection
CN112000442A (en) Method and device for automatically acquiring cluster state based on kubernets platform
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination