CN111782345B

CN111782345B - Container cloud platform log collection and analysis alarm method

Info

Publication number: CN111782345B
Application number: CN202010644149.9A
Authority: CN
Inventors: 岳志军; 尚尔卓; 岳鑫
Original assignee: Zhengzhou Dvelop Technology Co ltd
Current assignee: Zhengzhou Dvelop Technology Co ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2022-10-28
Anticipated expiration: 2040-07-07
Also published as: CN111782345A

Abstract

The invention relates to a container cloud platform log collection and analysis alarm method, which is based on a Kubernets container cloud platform, writes an Elasticiseach configuration file, deploys Elasticiseachs through a stateful copy set StateUlSets, and uses a storage class Storageclass to mount log storage so as to realize collection of cluster logs, container logs and application logs; processing the log data and transmitting the log data to a container cloud platform through a data interface, wherein the container cloud platform performs correlation analysis on the container log and the application log according to the established log feature library, and displays the function, the alarm notification and the strategy configuration of the report on a front-end interface when the alarm condition is met; the method can send alarm information to the alarm receiver and the user in time, thereby simplifying the difficulty of log analysis, reducing the system operation and maintenance cost and improving the readability of the log by development and operation and maintenance personnel.

Description

Container cloud platform log collection and analysis alarm method

(I) the technical field:

the invention relates to a cloud platform log collection and analysis alarm method, in particular to a container cloud platform log collection and analysis alarm method.

(II) background art:

kubernets, K8s for short, have become the de facto standard for container scheduling, both Docker authorities and messos have supported kubernets, which have also fallen to the ground in a large number of businesses, and some heavy platform customers such as GitHub, eBay and penbo have announced the migration of services to kubernets. Due to the wide application of the K8s, the development of corresponding function requirements becomes increasingly important, such as application monitoring, log analysis and the like, and the functions affect the landing of each item in cloud computing.

Due to the rapid development of application containerization, a large amount of demand analysis and log arrangement are generated in a container platform, so that the demands of development, operation and maintenance personnel during operation by using a containerization technology are met. However, the traditional management and operation and maintenance based on the Kubernetes cluster are complex and complicated work, and certain time cost and technical threshold are needed for collecting and analyzing massive container cloud platform log data. The log analysis of the traditional containerization platform usually only focuses on the container log display of the project application, often ignores the relevance between the container log and the application log, and is difficult for development, operation and maintenance personnel to acquire useful data at the first time in the face of massive log data so as to solve various problems in the production test environment.

(III) the invention content:

the technical problem to be solved by the invention is as follows: the container cloud platform log collection method can collect cluster logs, container logs and application logs, and the container cloud platform log analysis and alarm method can send alarm information to operation and maintenance managers, project developers, platform managers and users in time, so that the difficulty of log analysis is simplified, the operation and maintenance cost of a system is reduced, and the readability of the development and operation personnel on the logs is improved.

The technical scheme of the invention is as follows:

a container cloud platform log collection method is based on a Kubernets container cloud platform and comprises the following steps:

step 1.1, writing an Elasticiseach (search server based on Lucene) configuration file, deploying Elasticiseach through a stateful copy set StateUlSets, using a storage class Storageclass to mount log storage, generating PVC (Persistent Volume Claim) by the storage class Storageclass, and generating a domain name address for the Elasticiseach for reference;

step 1.2, collection of cluster logs: compiling a configuration file of a Filebeat (a lightweight log collector realized by using gold), deploying the configuration file to a daemon process set DaemonSet, and collecting/var/log/messages log data of each node;

step 1.3, collecting container logs: firstly, compiling an eventrouter (library of routing and screening events) container configuration file, and mounting a/data/log/eventrouter directory; then starting a Filebeat container to mount/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;

step 1.4, collection of application logs: and in the application deployment container, another Filebeat container is started to mount the application log directory, and the Filebeat is used for collecting log data under the application project directory.

In the step 1.1, PVC is based on nfs, cephfs and the like.

A container cloud platform log analysis and alarm method is characterized by firstly executing the container cloud platform log collection method and then carrying out the following steps:

step 2, processing the collected log data;

step 3, transmitting the processed log data to a container cloud platform through a data interface;

step 4, establishing a log feature library;

step 5, the container cloud platform performs correlation analysis on the container log and the application log according to the log feature library;

and 6, when the correlation analysis result reaches an alarm condition, displaying the functions of the report (such as generation, import and export of the report), alarm notification and strategy configuration on a front-end interface.

The step 2 comprises the following steps:

step 2.1, configuring a plurality of Logstash (the Logstash is an application program log and a platform for transmitting, processing, managing and searching events) nodes to be parallel, and uploading the cluster log, the container log and the application log to the Logstash;

and 2.2, respectively filtering the data of the cluster log, the container log and the application log by the Logstash, uploading the filtered log data to an Elasticiseach, and indexing and storing the filtered log data by the Elasticiseach.

The step 2.2 specifically comprises the following steps:

step 2.2.1, compiling a Logstash configuration file, and respectively cleaning the collected data of the cluster logs, the container logs and the application logs to form a standardized log recording format; the normalized field comprises a log timestamp, a pod ip address, a pod name, an associated label, a type, a message, an error type and the like;

and 2.2.2, the Elasticisach provides an http api interface externally, so that a user can search and query the cleaned log data according to the requirement of the user, wherein the search and query comprise setting keywords (such as an associated label, time, error type and the like) and quickly positioning the corresponding log to the application and system to which the log belongs.

And the data interface in the step 3 is an http api interface.

Step 4, in order to realize accurate positioning of log types, fault classification is carried out on log types, and a log feature library is established, wherein the log feature library comprises a primary fault log feature library, a secondary fault log feature library and a tertiary fault log feature library;

the primary fault log mainly refers to a kubernets component service fault and cluster fault log, and comprises the following steps: node state NotReady, critical component uhealth;

the secondary fault log mainly refers to a pod state fault log, and comprises: always Pending, always Waiting, always contineering, always ImagePullBackOff, at CrashLoopBackOff, at Error, always Terminating and at unrknown;

after detecting event alarm keywords, the system outputs alarm log data information, and gives fault levels, specific reasons and processing feedback in an alarm information column:

(1) Keyword% Pending is involved: meaning that the YAML file for Pod has been submitted to Kubernets, the API object has been created and saved in Etcd. However, some containers in this Pod cannot be created successfully for some reason, such as unsuccessful scheduling. The possible reasons are: the resources are insufficient (all nodes in the cluster do not meet the resources such as CPU, memory, GPU and the like of the Pod request); the HostPort is already occupied (it is usually recommended to use Service to open a Service port to the outside).

(2) Keyword related% Waiting% or% containerrcreation%: according to the log output data information, the following reasons are automatically matched:

a. mirror pull failures such as mirror address configuration error, no foreign mirror source (gcr. Io) pulled, private mirror key configuration error, mirror too large resulting in pull timeout (the-image-pull-progress-delete and-run-request-timeout options of kubel can be adjusted appropriately), etc.

b. CNI network errors, generally require checking the configuration of CNI network plug-ins, such as: the Pod network cannot be configured and the IP address cannot be assigned.

c. The container cannot be started and it needs to be checked whether the correct image is packed or whether the correct container parameters are configured.

d. Failed create point sandbox, look at kubelelet log, possibly because of an input/output error.

(3) The keyword% ImagePullBackOff% is referred to, meaning that an image name configuration error or a key configuration error of the private image results.

(4) The keyword% CrashLoopBackOff is referred to, meaning that the container was once started, but exited abnormally. The reasons may be that the container process exits, that the health check fails exits, etc.

(5) The keyword% Error% is referred to, which indicates that an Error occurs during the Pod boot process. Common causes are: the dependent ConfigMap, secret or PV, etc. are not present; the requested resource exceeds the limit set by the administrator, such as exceeding the limit range, etc.; violating the security policy of the cluster, such as violating PodSecurityPolicy, etc.; the container cannot operate the resources in the cluster, for example, after the RDAC is started, a role binding needs to be configured for the ServiceAccount.

(6) The keyword% Terminating% or% Unknown% state is referred to, typically because a Node is out of sync and does not delete the Pod that is running on it. The abnormal status of Unknown means that the status of Pod cannot be reported to the kube-api continuously by kubel, which is likely to be a problem in communication between Master and slave nodes (Master and Kubelet).

Third-level fault logging: mainly aiming at service logs, when the number of ERRORs in the service logs exceeds a preset value, a three-level fault alarm mechanism is triggered, and the ERROR state generally means that a large bug occurs in an online service and needs to be specifically solved by developers.

After container logs and application logs are screened by using the log feature library, a fault list can be automatically generated, the container cloud platform sends alarm information to an alarm receiver and a user according to the fault level, and the alarm receiver can perform full-text retrieval on the platform through a timestamp or an accurate value to obtain detailed log information.

Alarm recipients are divided into three categories: operation and maintenance managers, project developers and platform managers; the function authority of the platform administrator is to create (name, telephone number and mailbox), modify and delete, distribute the alarm receiver corresponding to the alarm level and acquire the copy information of the alarm task.

When the first-level fault alarm is triggered, the container cloud platform informs an operation and maintenance manager in the form of a short message or an email and copies the short message or the email to the platform manager;

when the secondary fault alarm is triggered, the container cloud platform informs a project developer in the form of a short message or an email and copies the short message or the email to an operation and maintenance manager and a platform manager;

when the three-level fault alarm is triggered, due to the continuous update of the online service log, the alarm and the receiving adopt a mode of executing an alarm script by a server background, usually, when the number of ERRORs exceeds 20, the container cloud platform immediately sends out a first alarm at the first time, informs project developers in the form of short messages or mails and copies the first alarm to a platform administrator; and then, adjusting the alarm frequency according to sleep in the script to ensure the timeliness of monitoring alarm.

A user can check error logs in real time on a preview interface on the cloud platform of the container, so that the user can conveniently distinguish the service failure mode in time, a solution can be made more quickly, and the stability of online service is kept.

The invention has the beneficial effects that:

1. the method and the system collect the cluster logs, the container logs and the application logs, the container cloud platform performs correlation analysis and display on the container logs and the application logs according to the log feature library, and timely sends alarm information to an operation and maintenance manager, a project developer, a platform manager and a user according to the fault level, and the development and operation manager can obtain useful data at the first time and timely solve various problems in a production test environment.

2. According to the invention, a log feature library is established, fault classification is carried out on the log category, semi-automatic operation and maintenance and management of a container cloud platform are realized, the difficulty of log analysis is simplified, and the log data analysis and system operation and maintenance cost is reduced.

3. According to the invention, the cluster logs, the container logs and the application log data are comprehensively collected, and the container cloud platform performs association analysis and display on the logs according to the log feature library, so that the readability of development, operation and maintenance personnel on the logs is improved.

(IV) specific embodiment:

the container cloud platform log collection method is based on a Kubernets container cloud platform, and comprises the following steps:

step 1.3, collecting container logs: firstly, compiling an eventrouter (database for routing and screening events) container configuration file, and mounting/data/log/eventrouter directory; then starting a Filebeat container to mount a/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;

In step 1.1, PVC is based on nfs, cephfs and the like.

The container cloud platform log analysis and alarm method comprises the following steps:

step 1, executing the container cloud platform log collection method;

step 2, processing the collected log data;

step 4, establishing a log feature library;

and 6, when the correlation analysis result reaches an alarm condition, displaying the functions of the report (such as generation, import and export of the report), the alarm notification and the strategy configuration on a front-end interface.

The step 2 comprises the following steps:

The step 2.2 specifically comprises the following steps:

And the data interface in the step 3 is an http api interface.

the secondary fault log mainly refers to a pod state fault log, and comprises the following steps: constantly in Pending, constantly in Waiting, constantly in contenier creating, constantly in imagepullback off, constantly in crashloopback off, in Error, constantly in Terminating, and in nknown;

(1) Keyword% Pending: meaning that the YAML file for Pod has been submitted to Kubernets, the API object has been created and saved in Etcd. However, some containers in this Pod cannot be created successfully for some reason, such as unsuccessful scheduling. The possible reasons are: the resources are insufficient (all nodes in the cluster do not meet the resources such as CPU, memory, GPU and the like of the Pod request); the HostPort is already occupied (it is usually recommended to use Service to open a Service port to the outside).

(2) Involvement of keyword% Waiting% or% containerrcreation%: according to the log output data information, the following reasons are automatically matched:

CNI network errors, typically require checking the configuration of CNI network plug-ins, such as: the Pod network cannot be configured and the IP address cannot be assigned.

The container cannot be started and it needs to be checked whether the correct image is packed or whether the correct container parameters are configured.

Failed create point sandbox, look at kubelelet logs, possibly because of an input/output error.

(4) The keyword% CrashLoopBackOff is referred to, meaning that the container was once started, but exited abnormally. The reasons may be that the container process exits, that the health check fails to exit, etc.

(5) The keyword% Error% is referred to, which indicates that an Error occurs during the Pod boot process. Common causes are: absence of dependent ConfigMap, secret or PV, etc.; the requested resource exceeds the limit set by the administrator, such as exceeding the limit range, etc.; a security policy that violates the cluster, such as PodSecurityPolicy, etc.; the container cannot operate the resources in the cluster, for example, after the RDAC is started, role binding needs to be configured for the ServiceAccount.

The alarm recipients are classified into three types: operation and maintenance managers, project developers and platform managers; the function authority of the platform administrator is to create (name, telephone number, mail box), modify, delete, allocate the alarm receiver corresponding to the alarm level, and obtain the copy information of the alarm task.

When the primary fault alarm is triggered, the container cloud platform informs an operation and maintenance manager in the form of a short message or an email and copies the short message or the email to the platform manager;

Claims

1. A container cloud platform log analysis and alarm method is characterized in that: the Kubernetes container cloud platform comprises the following steps:

step 1.1, compiling an Elasticiseach configuration file, deploying Elasticiseach through a stateful copy set StateUlSets, using a storage class Storageclass to mount log storage, generating PVC by the storage class Storageclass, and generating a domain name address by the Elasticiseach;

step 1.2, collection of cluster logs: compiling a configuration file of Filebeat, deploying the configuration file to a daemon process set DaemonSet, and collecting/var/log/messages log data of each node;

step 1.3, collecting container logs: firstly, compiling an eventrouter container configuration file, and mounting a/data/log/eventrouter directory; then starting a Filebeat container to mount/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;

step 1.4, collection of application logs: in the application deployment container, another Filebeat container is started to mount an application log directory, and the Filebeat is used for collecting log data under the application project directory;

step 2, processing the collected log data;

step 4, establishing a log feature library;

step 6, when the correlation analysis result reaches the alarm condition, displaying the report function, the alarm notice and the strategy configuration on a front-end interface;

in the step 4, fault classification is carried out on the log categories, and a log feature library is established, wherein the log feature library comprises a primary fault log feature library, a secondary fault log feature library and a tertiary fault log feature library;

the primary fault log refers to a kubernets component service fault and cluster fault log;

the secondary fault log is a pod state fault log;

third-level fault logging: when the ERROR number in the service log exceeds a preset value, triggering a three-level fault alarm mechanism;

after container logs and application logs are screened by using the log feature library, a fault list is automatically generated, the container cloud platform sends alarm information to an alarm receiver and a user according to the fault level, and the alarm receiver performs full-text retrieval on the platform through a timestamp or an accurate value to acquire detailed log information.

2. The container cloud platform log analysis and alarm method according to claim 1, wherein: the step 2 comprises the following steps:

step 2.1, configuring a plurality of logstack nodes to be parallel, and uploading the cluster logs, the container logs and the application logs to the logstack;

3. The container cloud platform log analysis and alarm method according to claim 2, wherein: the step 2.2 specifically comprises the following steps:

step 2.2.1, compiling a Logstash configuration file, and respectively cleaning the collected data of the cluster log, the container log and the application log to form a standardized log recording format;

and 2.2.2, providing an http api interface for the external by the elasticsearch, and searching and inquiring the cleaned log data by the user according to the requirement of the user and quickly positioning the cleaned log to the application and system to which the corresponding log belongs.

4. The container cloud platform log analysis and alarm method according to claim 1, wherein: and the data interface in the step 3 is an http api interface.

5. The container cloud platform log analysis and alarm method according to claim 1, wherein: the alarm recipients are divided into three categories: operation and maintenance managers, project developers and platform managers;

when the three-level fault alarm is triggered, the alarm and receiving adopt a mode that a server background executes an alarm script, when the number of ERRORs exceeds 20, the container cloud platform immediately sends out a first alarm, informs a project developer in a form of short messages or mails and copies the first alarm to a platform administrator.

6. The container cloud platform log analysis and alarm method according to claim 1, wherein: in the step 1.1, PVC is PVC based on nfs and cephfs.