CN111782345A

CN111782345A - Container cloud platform log collection and analysis alarm method

Info

Publication number: CN111782345A
Application number: CN202010644149.9A
Authority: CN
Inventors: 岳志军; 尚尔卓; 岳鑫
Original assignee: Zhengzhou Dvelop Technology Co ltd
Current assignee: Zhengzhou Dvelop Technology Co ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-16
Anticipated expiration: 2040-07-07
Also published as: CN111782345B

Abstract

The invention relates to a container cloud platform log collection and analysis alarm method, which is based on a Kubernets container cloud platform, writes an Elasticiseach configuration file, deploys Elasticiseachs through a stateful copy set StateUlSets, and uses a storage class Storageclass to mount log storage so as to realize collection of cluster logs, container logs and application logs; processing the log data and transmitting the log data to a container cloud platform through a data interface, wherein the container cloud platform performs correlation analysis on the container log and the application log according to the established log feature library, and displays the function, the alarm notification and the strategy configuration of the report on a front-end interface when the alarm condition is met; the method can send alarm information to the alarm receiver and the user in time, simplify the difficulty of log analysis, reduce the system operation and maintenance cost, and improve the readability of the log by development operation and maintenance personnel.

Description

Container cloud platform log collection and analysis alarm method

The technical field is as follows:

the invention relates to a cloud platform log collection and analysis alarm method, in particular to a container cloud platform log collection and analysis alarm method.

(II) background art:

kubernets, abbreviated K8s, has become the de facto standard for container scheduling, both Docker authorities and messos have supported kubernets, which have also fallen to the ground in a large number of enterprises, and some heavy platform customers such as GitHub, eBay and penbo have announced the migration of services to kubernets. The wide application of K8s makes the development of the corresponding function requirements become increasingly important, such as application monitoring, log analysis, etc., and these functions all affect the landing of each item in cloud computing.

Due to the rapid development of the containerization, a large amount of demand analysis and log arrangement are generated in a container platform, so that the demands of development, operation and maintenance personnel during operation by using the containerization technology are met. However, the traditional management and operation and maintenance based on the Kubernetes cluster are complex and complicated work, and certain time cost and technical threshold are needed for collecting and analyzing massive container cloud platform log data. The log analysis of the traditional containerization platform usually only focuses on the container log display of the project application, often ignores the relevance between the container log and the application log, and is difficult for development, operation and maintenance personnel to acquire useful data at the first time in the face of massive log data so as to solve various problems in the production test environment.

(III) the invention content:

the technical problem to be solved by the invention is as follows: the container cloud platform log collection method can collect cluster logs, container logs and application logs, and the container cloud platform log analysis and alarm method can send alarm information to operation and maintenance managers, project developers, platform managers and users in time, so that the difficulty of log analysis is simplified, the operation and maintenance cost of a system is reduced, and the readability of the development and operation personnel on the logs is improved.

The technical scheme of the invention is as follows:

a container cloud platform log collection method is based on a Kubernets container cloud platform and comprises the following steps:

step 1.1, writing an Elasticiseach (search server based on Lucene) configuration file, deploying Elasticiseach through a stateful copy set StateUlSets, using a storage class Storageclass to mount log storage, generating PVC (Persistent Volume Claim) by the storage class Storageclass, and generating a domain name address for the Elasticiseach for reference;

step 1.2, collection of cluster logs: compiling a configuration file of a Filebeat (a lightweight log collector realized by using gold), deploying the configuration file to a daemon process set DaemonSet, and collecting/var/log/messages log data of each node;

step 1.3, collecting container logs: firstly, compiling an eventrouter (library of routing and screening events) container configuration file, and mounting a/data/log/eventrouter directory; then starting a Filebeat container to mount a/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;

step 1.4, collection of application logs: and in the application deployment container, another Filebeat container is started to mount the application log directory, and the Filebeat is used for collecting log data under the application project directory.

In step 1.1, PVC is based on nfs, cephfs and the like.

A container cloud platform log analysis and alarm method is characterized in that the container cloud platform log collection method is executed firstly, and then the following steps are carried out:

step 2, processing the collected log data;

step 3, transmitting the processed log data to a container cloud platform through a data interface;

step 4, establishing a log feature library;

step 5, the container cloud platform performs correlation analysis on the container log and the application log according to the log feature library;

and 6, when the correlation analysis result reaches an alarm condition, displaying the functions of the report (such as generation, import and export of the report), alarm notification and strategy configuration on a front-end interface.

The step 2 comprises the following steps:

step 2.1, configuring a plurality of Logstash (the Logstash is an application program log and a platform for transmitting, processing, managing and searching events) nodes to be parallel, and uploading the cluster log, the container log and the application log to the Logstash;

and 2.2, respectively filtering the data of the cluster log, the container log and the application log by the Logstash, uploading the filtered log data to an Elasticiseach, and indexing and storing the filtered log data by the Elasticiseach.

The step 2.2 specifically comprises the following steps:

step 2.2.1, compiling a Logstash configuration file, and respectively cleaning the collected data of the cluster log, the container log and the application log to form a standardized log recording format; the normalized field comprises a log timestamp, a pod ip address, a pod name, an associated label, a type, a message, an error type and the like;

and 2.2.2, the Elasticisach provides an http api interface to the outside, and a user can search and query the cleaned log data according to the requirement of the user, wherein the search and query comprise setting keywords (such as an associated label, time, error type and the like) and quickly positioning the application and the system to which the corresponding log belongs.

And 3, the data interface in the step 3 is an http api interface.

Step 4, in order to realize accurate positioning of log types, fault classification is carried out on log types, and a log feature library is established, wherein the log feature library comprises a primary fault log feature library, a secondary fault log feature library and a tertiary fault log feature library;

the primary fault log mainly refers to a kubernets component service fault and cluster fault log, and comprises the following steps: node state NotReady, critical component uhealth;

the secondary fault log mainly refers to a pod state fault log, and comprises the following steps: always Pending, always Waiting, always contineering, always ImagePullBackOff, at CrashLoopBackOff, at Error, always Terminating and at unrknown;

after detecting event alarm keywords, the system outputs alarm log data information, and gives fault levels, specific reasons and processing feedback in an alarm information column:

(1) keyword% Pending is involved: meaning that the YAML file for Pod has been submitted to Kubernets, the API object has been created and saved in Etcd. However, some containers in this Pod cannot be created successfully for some reason, such as unsuccessful scheduling. The possible reasons are: the resources are insufficient (all nodes in the cluster do not meet the resources such as CPU, memory, GPU and the like of the Pod request); the HostPort is already occupied (it is usually recommended to use Service to open a Service port to the outside).

(2) Involvement of keyword% Waiting% or% containerrcreation%: according to the log output data information, the following reasons are automatically matched:

a. mirror pull failures such as mirror address configuration error, no foreign mirror source (gcr. io) pulled, private mirror key configuration error, mirror too large resulting in pull timeout (the-image-pull-progress-delete and-run-request-timeout options of kubel can be adjusted appropriately), etc.

b. CNI network errors, generally require checking the configuration of CNI network plug-ins, such as: the Pod network cannot be configured and the IP address cannot be assigned.

c. The container cannot be started and it needs to be checked whether the correct image is packed or whether the correct container parameters are configured.

d. Failed create point sandbox, look at kubelelet log, possibly because of an input/output error.

(3) The keyword% ImagePullBackOff% is referred to, meaning that an image name configuration error or a key configuration error of the private image results.

(4) The keyword% CrashLoopBackOff is referred to, meaning that the container was once started, but exited abnormally. The reasons may be that the container process exits, that the health check fails exits, etc.

(5) The keyword% Error% is referred to, which indicates that an Error occurs during the Pod boot process. Common causes are: absence of dependent ConfigMap, Secret or PV, etc.; the requested resource exceeds the limit set by the administrator, such as exceeding the limit range, etc.; a security policy that violates the cluster, such as PodSecurityPolicy, etc.; the container cannot operate the resources in the cluster, for example, after the RDAC is started, role binding needs to be configured for the ServiceAccount.

(6) The keyword% Terminating% or% Unknown% state is referred to, typically because a Node is out of sync and does not delete the Pod that is running on it. The abnormal status of Unknown means that the status of Pod cannot be reported to the kube-api continuously by kubel, which is likely to be a problem in communication between Master and slave nodes (Master and Kubelet).

Third-level fault logging: mainly aiming at a service log, when the number of ERRORs in the service log exceeds a preset value, a three-level fault alarm mechanism is triggered, and the ERROR state generally means that a large bug appears in an online service and needs to be specifically solved by developers.

After container logs and application logs are screened by using the log feature library, a fault list can be automatically generated, the container cloud platform sends alarm information to an alarm receiver and a user according to the fault level, and the alarm receiver can perform full-text retrieval on the platform through a timestamp or an accurate value to obtain detailed log information.

Alarm recipients are divided into three categories: operation and maintenance managers, project developers and platform managers; the function authority of the platform administrator is to create (name, telephone number, mail box), modify, delete, allocate the alarm receiver corresponding to the alarm level, and obtain the copy information of the alarm task.

When the primary fault alarm is triggered, the container cloud platform informs an operation and maintenance manager in the form of a short message or an email and copies the short message or the email to the platform manager;

when the secondary fault alarm is triggered, the container cloud platform informs a project developer in the form of a short message or an email and copies the short message or the email to an operation and maintenance manager and a platform manager;

when the three-level fault alarm is triggered, due to the continuous update of the online service log, the alarm and the receiving adopt a mode of executing an alarm script by a server background, usually, when the number of ERRORs exceeds 20, the container cloud platform immediately sends out a first alarm at the first time, informs project developers in the form of short messages or mails and copies the first alarm to a platform administrator; and then, adjusting the alarm frequency according to sleep in the script to ensure the timeliness of monitoring alarm.

A user can check error logs in real time on a preview interface on the cloud container platform, so that the user can conveniently distinguish a service fault form in time, a solution is made more quickly, and the stability of service online is maintained.

The invention has the beneficial effects that:

1. the method and the system collect the cluster logs, the container logs and the application logs, the container cloud platform performs correlation analysis and display on the container logs and the application logs according to the log feature library, and timely sends alarm information to an operation and maintenance manager, a project developer, a platform manager and a user according to the fault level, and the development and operation manager can obtain useful data at the first time and timely solve various problems in a production test environment.

2. According to the invention, a log feature library is established, fault classification is carried out on the log category, semi-automatic operation and maintenance and management of a container cloud platform are realized, the difficulty of log analysis is simplified, and the log data analysis and system operation and maintenance cost is reduced.

3. The invention comprehensively collects cluster logs, container logs and application log data, and the container cloud platform performs correlation analysis and display on the logs according to the log feature library, thereby improving the readability of development, operation and maintenance personnel on the logs.

(IV) specific embodiment:

the container cloud platform log collection method is based on a Kubernets container cloud platform, and comprises the following steps:

In step 1.1, PVC is based on nfs, cephfs and the like.

The container cloud platform log analysis and alarm method comprises the following steps:

step 1, executing the container cloud platform log collection method;

step 2, processing the collected log data;

step 4, establishing a log feature library;

The step 2 comprises the following steps:

The step 2.2 specifically comprises the following steps:

And 3, the data interface in the step 3 is an http api interface.

CNI network errors, typically require checking the configuration of CNI network plug-ins, such as: the Pod network cannot be configured and the IP address cannot be assigned.

The container cannot be started and it needs to be checked whether the correct image is packed or whether the correct container parameters are configured.

Failed create point sandbox, look at kubelelet logs, possibly because of an input/output error.

Claims

1. A container cloud platform log collection method is characterized by comprising the following steps: the Kubernetes container cloud platform comprises the following steps:

step 1.1, writing an Elasticiseach configuration file, deploying Elasticiseach through a stateful copy set Stateful sets, using a storage class Storageclass to mount log storage, generating PVC (polyvinyl chloride) by the storage class Storageclass, and generating a domain name address by the Elasticiseach;

step 1.2, collection of cluster logs: compiling a configuration file of Filebeat, deploying the configuration file to a daemon process set DaemonSet, and collecting/var/log/messages log data of each node;

step 1.3, collecting container logs: firstly, compiling an eventrouter container configuration file, and mounting a/data/log/eventrouter directory; then starting a Filebeat container to mount a/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;

2. The container cloud platform log collection method of claim 1, wherein: in the step 1.1, PVC is PVC based on nfs and cephfs.

3. A container cloud platform log analysis and alarm method comprising the container cloud platform log collection method according to claim 1 or 2, characterized in that: firstly, executing the container cloud platform log collection method, and then performing the following steps:

step 2, processing the collected log data;

step 4, establishing a log feature library;

and 6, when the correlation analysis result reaches an alarm condition, displaying the report function, the alarm notice and the strategy configuration on a front-end interface.

4. The container cloud platform log analysis and alarm method according to claim 3, wherein: the step 2 comprises the following steps:

step 2.1, configuring a plurality of Logstash nodes to be parallel, and uploading the cluster logs, the container logs and the application logs to the Logstash;

5. The container cloud platform log analysis and alarm method according to claim 4, wherein: the step 2.2 specifically comprises the following steps:

step 2.2.1, compiling a Logstash configuration file, and respectively cleaning the collected data of the cluster log, the container log and the application log to form a standardized log recording format;

and 2.2.2, the Elasticisach provides an http api interface externally, and the user searches and inquires the cleaned log data according to the requirement of the user and quickly positions the log data to the application and system to which the corresponding log belongs.

6. The container cloud platform log analysis and alarm method according to claim 3, wherein: and the data interface in the step 3 is an http api interface.

7. The container cloud platform log analysis and alarm method according to claim 3, wherein: in the step 4, fault classification is carried out on the log categories, and a log feature library is established, wherein the log feature library comprises a primary fault log feature library, a secondary fault log feature library and a tertiary fault log feature library;

the primary fault log refers to a kubernets component service fault and cluster fault log;

the secondary fault log is a pod state fault log;

third-level fault logging: and when the ERROR number in the service log exceeds a preset value, triggering a three-level fault alarm mechanism.

8. The container cloud platform log analysis alarm method according to claim 7, wherein: after container logs and application logs are screened by using the log feature library, a fault list is automatically generated, the container cloud platform sends alarm information to an alarm receiver and a user according to the fault level, and the alarm receiver performs full-text retrieval on the platform through a timestamp or an accurate value to acquire detailed log information.

9. The container cloud platform log analysis alarm method according to claim 8, wherein: the above-mentioned

Alarm recipients are divided into three categories: operation and maintenance managers, project developers and platform managers;

when the three-level fault alarm is triggered, the alarm and receiving adopt a mode that a server background executes an alarm script, when the number of ERRORs exceeds 20, the container cloud platform immediately sends out a first alarm, informs a project developer in a form of short messages or mails and copies the first alarm to a platform administrator.