CN111782345A - Container cloud platform log collection and analysis alarm method - Google Patents

Container cloud platform log collection and analysis alarm method Download PDF

Info

Publication number
CN111782345A
CN111782345A CN202010644149.9A CN202010644149A CN111782345A CN 111782345 A CN111782345 A CN 111782345A CN 202010644149 A CN202010644149 A CN 202010644149A CN 111782345 A CN111782345 A CN 111782345A
Authority
CN
China
Prior art keywords
log
container
alarm
cloud platform
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010644149.9A
Other languages
Chinese (zh)
Other versions
CN111782345B (en
Inventor
岳志军
尚尔卓
岳鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Dvelop Technology Co ltd
Original Assignee
Zhengzhou Dvelop Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Dvelop Technology Co ltd filed Critical Zhengzhou Dvelop Technology Co ltd
Priority to CN202010644149.9A priority Critical patent/CN111782345B/en
Publication of CN111782345A publication Critical patent/CN111782345A/en
Application granted granted Critical
Publication of CN111782345B publication Critical patent/CN111782345B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a container cloud platform log collection and analysis alarm method, which is based on a Kubernets container cloud platform, writes an Elasticiseach configuration file, deploys Elasticiseachs through a stateful copy set StateUlSets, and uses a storage class Storageclass to mount log storage so as to realize collection of cluster logs, container logs and application logs; processing the log data and transmitting the log data to a container cloud platform through a data interface, wherein the container cloud platform performs correlation analysis on the container log and the application log according to the established log feature library, and displays the function, the alarm notification and the strategy configuration of the report on a front-end interface when the alarm condition is met; the method can send alarm information to the alarm receiver and the user in time, simplify the difficulty of log analysis, reduce the system operation and maintenance cost, and improve the readability of the log by development operation and maintenance personnel.

Description

Container cloud platform log collection and analysis alarm method
The technical field is as follows:
the invention relates to a cloud platform log collection and analysis alarm method, in particular to a container cloud platform log collection and analysis alarm method.
(II) background art:
kubernets, abbreviated K8s, has become the de facto standard for container scheduling, both Docker authorities and messos have supported kubernets, which have also fallen to the ground in a large number of enterprises, and some heavy platform customers such as GitHub, eBay and penbo have announced the migration of services to kubernets. The wide application of K8s makes the development of the corresponding function requirements become increasingly important, such as application monitoring, log analysis, etc., and these functions all affect the landing of each item in cloud computing.
Due to the rapid development of the containerization, a large amount of demand analysis and log arrangement are generated in a container platform, so that the demands of development, operation and maintenance personnel during operation by using the containerization technology are met. However, the traditional management and operation and maintenance based on the Kubernetes cluster are complex and complicated work, and certain time cost and technical threshold are needed for collecting and analyzing massive container cloud platform log data. The log analysis of the traditional containerization platform usually only focuses on the container log display of the project application, often ignores the relevance between the container log and the application log, and is difficult for development, operation and maintenance personnel to acquire useful data at the first time in the face of massive log data so as to solve various problems in the production test environment.
(III) the invention content:
the technical problem to be solved by the invention is as follows: the container cloud platform log collection method can collect cluster logs, container logs and application logs, and the container cloud platform log analysis and alarm method can send alarm information to operation and maintenance managers, project developers, platform managers and users in time, so that the difficulty of log analysis is simplified, the operation and maintenance cost of a system is reduced, and the readability of the development and operation personnel on the logs is improved.
The technical scheme of the invention is as follows:
a container cloud platform log collection method is based on a Kubernets container cloud platform and comprises the following steps:
step 1.1, writing an Elasticiseach (search server based on Lucene) configuration file, deploying Elasticiseach through a stateful copy set StateUlSets, using a storage class Storageclass to mount log storage, generating PVC (Persistent Volume Claim) by the storage class Storageclass, and generating a domain name address for the Elasticiseach for reference;
step 1.2, collection of cluster logs: compiling a configuration file of a Filebeat (a lightweight log collector realized by using gold), deploying the configuration file to a daemon process set DaemonSet, and collecting/var/log/messages log data of each node;
step 1.3, collecting container logs: firstly, compiling an eventrouter (library of routing and screening events) container configuration file, and mounting a/data/log/eventrouter directory; then starting a Filebeat container to mount a/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;
step 1.4, collection of application logs: and in the application deployment container, another Filebeat container is started to mount the application log directory, and the Filebeat is used for collecting log data under the application project directory.
In step 1.1, PVC is based on nfs, cephfs and the like.
A container cloud platform log analysis and alarm method is characterized in that the container cloud platform log collection method is executed firstly, and then the following steps are carried out:
step 2, processing the collected log data;
step 3, transmitting the processed log data to a container cloud platform through a data interface;
step 4, establishing a log feature library;
step 5, the container cloud platform performs correlation analysis on the container log and the application log according to the log feature library;
and 6, when the correlation analysis result reaches an alarm condition, displaying the functions of the report (such as generation, import and export of the report), alarm notification and strategy configuration on a front-end interface.
The step 2 comprises the following steps:
step 2.1, configuring a plurality of Logstash (the Logstash is an application program log and a platform for transmitting, processing, managing and searching events) nodes to be parallel, and uploading the cluster log, the container log and the application log to the Logstash;
and 2.2, respectively filtering the data of the cluster log, the container log and the application log by the Logstash, uploading the filtered log data to an Elasticiseach, and indexing and storing the filtered log data by the Elasticiseach.
The step 2.2 specifically comprises the following steps:
step 2.2.1, compiling a Logstash configuration file, and respectively cleaning the collected data of the cluster log, the container log and the application log to form a standardized log recording format; the normalized field comprises a log timestamp, a pod ip address, a pod name, an associated label, a type, a message, an error type and the like;
and 2.2.2, the Elasticisach provides an http api interface to the outside, and a user can search and query the cleaned log data according to the requirement of the user, wherein the search and query comprise setting keywords (such as an associated label, time, error type and the like) and quickly positioning the application and the system to which the corresponding log belongs.
And 3, the data interface in the step 3 is an http api interface.
Step 4, in order to realize accurate positioning of log types, fault classification is carried out on log types, and a log feature library is established, wherein the log feature library comprises a primary fault log feature library, a secondary fault log feature library and a tertiary fault log feature library;
the primary fault log mainly refers to a kubernets component service fault and cluster fault log, and comprises the following steps: node state NotReady, critical component uhealth;
the secondary fault log mainly refers to a pod state fault log, and comprises the following steps: always Pending, always Waiting, always contineering, always ImagePullBackOff, at CrashLoopBackOff, at Error, always Terminating and at unrknown;
after detecting event alarm keywords, the system outputs alarm log data information, and gives fault levels, specific reasons and processing feedback in an alarm information column:
(1) keyword% Pending is involved: meaning that the YAML file for Pod has been submitted to Kubernets, the API object has been created and saved in Etcd. However, some containers in this Pod cannot be created successfully for some reason, such as unsuccessful scheduling. The possible reasons are: the resources are insufficient (all nodes in the cluster do not meet the resources such as CPU, memory, GPU and the like of the Pod request); the HostPort is already occupied (it is usually recommended to use Service to open a Service port to the outside).
(2) Involvement of keyword% Waiting% or% containerrcreation%: according to the log output data information, the following reasons are automatically matched:
a. mirror pull failures such as mirror address configuration error, no foreign mirror source (gcr. io) pulled, private mirror key configuration error, mirror too large resulting in pull timeout (the-image-pull-progress-delete and-run-request-timeout options of kubel can be adjusted appropriately), etc.
b. CNI network errors, generally require checking the configuration of CNI network plug-ins, such as: the Pod network cannot be configured and the IP address cannot be assigned.
c. The container cannot be started and it needs to be checked whether the correct image is packed or whether the correct container parameters are configured.
d. Failed create point sandbox, look at kubelelet log, possibly because of an input/output error.
(3) The keyword% ImagePullBackOff% is referred to, meaning that an image name configuration error or a key configuration error of the private image results.
(4) The keyword% CrashLoopBackOff is referred to, meaning that the container was once started, but exited abnormally. The reasons may be that the container process exits, that the health check fails exits, etc.
(5) The keyword% Error% is referred to, which indicates that an Error occurs during the Pod boot process. Common causes are: absence of dependent ConfigMap, Secret or PV, etc.; the requested resource exceeds the limit set by the administrator, such as exceeding the limit range, etc.; a security policy that violates the cluster, such as PodSecurityPolicy, etc.; the container cannot operate the resources in the cluster, for example, after the RDAC is started, role binding needs to be configured for the ServiceAccount.
(6) The keyword% Terminating% or% Unknown% state is referred to, typically because a Node is out of sync and does not delete the Pod that is running on it. The abnormal status of Unknown means that the status of Pod cannot be reported to the kube-api continuously by kubel, which is likely to be a problem in communication between Master and slave nodes (Master and Kubelet).
Third-level fault logging: mainly aiming at a service log, when the number of ERRORs in the service log exceeds a preset value, a three-level fault alarm mechanism is triggered, and the ERROR state generally means that a large bug appears in an online service and needs to be specifically solved by developers.
After container logs and application logs are screened by using the log feature library, a fault list can be automatically generated, the container cloud platform sends alarm information to an alarm receiver and a user according to the fault level, and the alarm receiver can perform full-text retrieval on the platform through a timestamp or an accurate value to obtain detailed log information.
Alarm recipients are divided into three categories: operation and maintenance managers, project developers and platform managers; the function authority of the platform administrator is to create (name, telephone number, mail box), modify, delete, allocate the alarm receiver corresponding to the alarm level, and obtain the copy information of the alarm task.
When the primary fault alarm is triggered, the container cloud platform informs an operation and maintenance manager in the form of a short message or an email and copies the short message or the email to the platform manager;
when the secondary fault alarm is triggered, the container cloud platform informs a project developer in the form of a short message or an email and copies the short message or the email to an operation and maintenance manager and a platform manager;
when the three-level fault alarm is triggered, due to the continuous update of the online service log, the alarm and the receiving adopt a mode of executing an alarm script by a server background, usually, when the number of ERRORs exceeds 20, the container cloud platform immediately sends out a first alarm at the first time, informs project developers in the form of short messages or mails and copies the first alarm to a platform administrator; and then, adjusting the alarm frequency according to sleep in the script to ensure the timeliness of monitoring alarm.
A user can check error logs in real time on a preview interface on the cloud container platform, so that the user can conveniently distinguish a service fault form in time, a solution is made more quickly, and the stability of service online is maintained.
The invention has the beneficial effects that:
1. the method and the system collect the cluster logs, the container logs and the application logs, the container cloud platform performs correlation analysis and display on the container logs and the application logs according to the log feature library, and timely sends alarm information to an operation and maintenance manager, a project developer, a platform manager and a user according to the fault level, and the development and operation manager can obtain useful data at the first time and timely solve various problems in a production test environment.
2. According to the invention, a log feature library is established, fault classification is carried out on the log category, semi-automatic operation and maintenance and management of a container cloud platform are realized, the difficulty of log analysis is simplified, and the log data analysis and system operation and maintenance cost is reduced.
3. The invention comprehensively collects cluster logs, container logs and application log data, and the container cloud platform performs correlation analysis and display on the logs according to the log feature library, thereby improving the readability of development, operation and maintenance personnel on the logs.
(IV) specific embodiment:
the container cloud platform log collection method is based on a Kubernets container cloud platform, and comprises the following steps:
step 1.1, writing an Elasticiseach (search server based on Lucene) configuration file, deploying Elasticiseach through a stateful copy set StateUlSets, using a storage class Storageclass to mount log storage, generating PVC (Persistent Volume Claim) by the storage class Storageclass, and generating a domain name address for the Elasticiseach for reference;
step 1.2, collection of cluster logs: compiling a configuration file of a Filebeat (a lightweight log collector realized by using gold), deploying the configuration file to a daemon process set DaemonSet, and collecting/var/log/messages log data of each node;
step 1.3, collecting container logs: firstly, compiling an eventrouter (library of routing and screening events) container configuration file, and mounting a/data/log/eventrouter directory; then starting a Filebeat container to mount a/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;
step 1.4, collection of application logs: and in the application deployment container, another Filebeat container is started to mount the application log directory, and the Filebeat is used for collecting log data under the application project directory.
In step 1.1, PVC is based on nfs, cephfs and the like.
The container cloud platform log analysis and alarm method comprises the following steps:
step 1, executing the container cloud platform log collection method;
step 2, processing the collected log data;
step 3, transmitting the processed log data to a container cloud platform through a data interface;
step 4, establishing a log feature library;
step 5, the container cloud platform performs correlation analysis on the container log and the application log according to the log feature library;
and 6, when the correlation analysis result reaches an alarm condition, displaying the functions of the report (such as generation, import and export of the report), alarm notification and strategy configuration on a front-end interface.
The step 2 comprises the following steps:
step 2.1, configuring a plurality of Logstash (the Logstash is an application program log and a platform for transmitting, processing, managing and searching events) nodes to be parallel, and uploading the cluster log, the container log and the application log to the Logstash;
and 2.2, respectively filtering the data of the cluster log, the container log and the application log by the Logstash, uploading the filtered log data to an Elasticiseach, and indexing and storing the filtered log data by the Elasticiseach.
The step 2.2 specifically comprises the following steps:
step 2.2.1, compiling a Logstash configuration file, and respectively cleaning the collected data of the cluster log, the container log and the application log to form a standardized log recording format; the normalized field comprises a log timestamp, a pod ip address, a pod name, an associated label, a type, a message, an error type and the like;
and 2.2.2, the Elasticisach provides an http api interface to the outside, and a user can search and query the cleaned log data according to the requirement of the user, wherein the search and query comprise setting keywords (such as an associated label, time, error type and the like) and quickly positioning the application and the system to which the corresponding log belongs.
And 3, the data interface in the step 3 is an http api interface.
Step 4, in order to realize accurate positioning of log types, fault classification is carried out on log types, and a log feature library is established, wherein the log feature library comprises a primary fault log feature library, a secondary fault log feature library and a tertiary fault log feature library;
the primary fault log mainly refers to a kubernets component service fault and cluster fault log, and comprises the following steps: node state NotReady, critical component uhealth;
the secondary fault log mainly refers to a pod state fault log, and comprises the following steps: always Pending, always Waiting, always contineering, always ImagePullBackOff, at CrashLoopBackOff, at Error, always Terminating and at unrknown;
after detecting event alarm keywords, the system outputs alarm log data information, and gives fault levels, specific reasons and processing feedback in an alarm information column:
(1) keyword% Pending is involved: meaning that the YAML file for Pod has been submitted to Kubernets, the API object has been created and saved in Etcd. However, some containers in this Pod cannot be created successfully for some reason, such as unsuccessful scheduling. The possible reasons are: the resources are insufficient (all nodes in the cluster do not meet the resources such as CPU, memory, GPU and the like of the Pod request); the HostPort is already occupied (it is usually recommended to use Service to open a Service port to the outside).
(2) Involvement of keyword% Waiting% or% containerrcreation%: according to the log output data information, the following reasons are automatically matched:
a. mirror pull failures such as mirror address configuration error, no foreign mirror source (gcr. io) pulled, private mirror key configuration error, mirror too large resulting in pull timeout (the-image-pull-progress-delete and-run-request-timeout options of kubel can be adjusted appropriately), etc.
CNI network errors, typically require checking the configuration of CNI network plug-ins, such as: the Pod network cannot be configured and the IP address cannot be assigned.
The container cannot be started and it needs to be checked whether the correct image is packed or whether the correct container parameters are configured.
Failed create point sandbox, look at kubelelet logs, possibly because of an input/output error.
(3) The keyword% ImagePullBackOff% is referred to, meaning that an image name configuration error or a key configuration error of the private image results.
(4) The keyword% CrashLoopBackOff is referred to, meaning that the container was once started, but exited abnormally. The reasons may be that the container process exits, that the health check fails exits, etc.
(5) The keyword% Error% is referred to, which indicates that an Error occurs during the Pod boot process. Common causes are: absence of dependent ConfigMap, Secret or PV, etc.; the requested resource exceeds the limit set by the administrator, such as exceeding the limit range, etc.; a security policy that violates the cluster, such as PodSecurityPolicy, etc.; the container cannot operate the resources in the cluster, for example, after the RDAC is started, role binding needs to be configured for the ServiceAccount.
(6) The keyword% Terminating% or% Unknown% state is referred to, typically because a Node is out of sync and does not delete the Pod that is running on it. The abnormal status of Unknown means that the status of Pod cannot be reported to the kube-api continuously by kubel, which is likely to be a problem in communication between Master and slave nodes (Master and Kubelet).
Third-level fault logging: mainly aiming at a service log, when the number of ERRORs in the service log exceeds a preset value, a three-level fault alarm mechanism is triggered, and the ERROR state generally means that a large bug appears in an online service and needs to be specifically solved by developers.
After container logs and application logs are screened by using the log feature library, a fault list can be automatically generated, the container cloud platform sends alarm information to an alarm receiver and a user according to the fault level, and the alarm receiver can perform full-text retrieval on the platform through a timestamp or an accurate value to obtain detailed log information.
Alarm recipients are divided into three categories: operation and maintenance managers, project developers and platform managers; the function authority of the platform administrator is to create (name, telephone number, mail box), modify, delete, allocate the alarm receiver corresponding to the alarm level, and obtain the copy information of the alarm task.
When the primary fault alarm is triggered, the container cloud platform informs an operation and maintenance manager in the form of a short message or an email and copies the short message or the email to the platform manager;
when the secondary fault alarm is triggered, the container cloud platform informs a project developer in the form of a short message or an email and copies the short message or the email to an operation and maintenance manager and a platform manager;
when the three-level fault alarm is triggered, due to the continuous update of the online service log, the alarm and the receiving adopt a mode of executing an alarm script by a server background, usually, when the number of ERRORs exceeds 20, the container cloud platform immediately sends out a first alarm at the first time, informs project developers in the form of short messages or mails and copies the first alarm to a platform administrator; and then, adjusting the alarm frequency according to sleep in the script to ensure the timeliness of monitoring alarm.
A user can check error logs in real time on a preview interface on the cloud container platform, so that the user can conveniently distinguish a service fault form in time, a solution is made more quickly, and the stability of service online is maintained.

Claims (9)

1. A container cloud platform log collection method is characterized by comprising the following steps: the Kubernetes container cloud platform comprises the following steps:
step 1.1, writing an Elasticiseach configuration file, deploying Elasticiseach through a stateful copy set Stateful sets, using a storage class Storageclass to mount log storage, generating PVC (polyvinyl chloride) by the storage class Storageclass, and generating a domain name address by the Elasticiseach;
step 1.2, collection of cluster logs: compiling a configuration file of Filebeat, deploying the configuration file to a daemon process set DaemonSet, and collecting/var/log/messages log data of each node;
step 1.3, collecting container logs: firstly, compiling an eventrouter container configuration file, and mounting a/data/log/eventrouter directory; then starting a Filebeat container to mount a/data/log/evenroute directory, and collecting log data under the/data/log/evenroute directory by using the Filebeat;
step 1.4, collection of application logs: and in the application deployment container, another Filebeat container is started to mount the application log directory, and the Filebeat is used for collecting log data under the application project directory.
2. The container cloud platform log collection method of claim 1, wherein: in the step 1.1, PVC is PVC based on nfs and cephfs.
3. A container cloud platform log analysis and alarm method comprising the container cloud platform log collection method according to claim 1 or 2, characterized in that: firstly, executing the container cloud platform log collection method, and then performing the following steps:
step 2, processing the collected log data;
step 3, transmitting the processed log data to a container cloud platform through a data interface;
step 4, establishing a log feature library;
step 5, the container cloud platform performs correlation analysis on the container log and the application log according to the log feature library;
and 6, when the correlation analysis result reaches an alarm condition, displaying the report function, the alarm notice and the strategy configuration on a front-end interface.
4. The container cloud platform log analysis and alarm method according to claim 3, wherein: the step 2 comprises the following steps:
step 2.1, configuring a plurality of Logstash nodes to be parallel, and uploading the cluster logs, the container logs and the application logs to the Logstash;
and 2.2, respectively filtering the data of the cluster log, the container log and the application log by the Logstash, uploading the filtered log data to an Elasticiseach, and indexing and storing the filtered log data by the Elasticiseach.
5. The container cloud platform log analysis and alarm method according to claim 4, wherein: the step 2.2 specifically comprises the following steps:
step 2.2.1, compiling a Logstash configuration file, and respectively cleaning the collected data of the cluster log, the container log and the application log to form a standardized log recording format;
and 2.2.2, the Elasticisach provides an http api interface externally, and the user searches and inquires the cleaned log data according to the requirement of the user and quickly positions the log data to the application and system to which the corresponding log belongs.
6. The container cloud platform log analysis and alarm method according to claim 3, wherein: and the data interface in the step 3 is an http api interface.
7. The container cloud platform log analysis and alarm method according to claim 3, wherein: in the step 4, fault classification is carried out on the log categories, and a log feature library is established, wherein the log feature library comprises a primary fault log feature library, a secondary fault log feature library and a tertiary fault log feature library;
the primary fault log refers to a kubernets component service fault and cluster fault log;
the secondary fault log is a pod state fault log;
third-level fault logging: and when the ERROR number in the service log exceeds a preset value, triggering a three-level fault alarm mechanism.
8. The container cloud platform log analysis alarm method according to claim 7, wherein: after container logs and application logs are screened by using the log feature library, a fault list is automatically generated, the container cloud platform sends alarm information to an alarm receiver and a user according to the fault level, and the alarm receiver performs full-text retrieval on the platform through a timestamp or an accurate value to acquire detailed log information.
9. The container cloud platform log analysis alarm method according to claim 8, wherein: the above-mentioned
Alarm recipients are divided into three categories: operation and maintenance managers, project developers and platform managers;
when the primary fault alarm is triggered, the container cloud platform informs an operation and maintenance manager in the form of a short message or an email and copies the short message or the email to the platform manager;
when the secondary fault alarm is triggered, the container cloud platform informs a project developer in the form of a short message or an email and copies the short message or the email to an operation and maintenance manager and a platform manager;
when the three-level fault alarm is triggered, the alarm and receiving adopt a mode that a server background executes an alarm script, when the number of ERRORs exceeds 20, the container cloud platform immediately sends out a first alarm, informs a project developer in a form of short messages or mails and copies the first alarm to a platform administrator.
CN202010644149.9A 2020-07-07 2020-07-07 Container cloud platform log collection and analysis alarm method Active CN111782345B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010644149.9A CN111782345B (en) 2020-07-07 2020-07-07 Container cloud platform log collection and analysis alarm method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010644149.9A CN111782345B (en) 2020-07-07 2020-07-07 Container cloud platform log collection and analysis alarm method

Publications (2)

Publication Number Publication Date
CN111782345A true CN111782345A (en) 2020-10-16
CN111782345B CN111782345B (en) 2022-10-28

Family

ID=72757862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010644149.9A Active CN111782345B (en) 2020-07-07 2020-07-07 Container cloud platform log collection and analysis alarm method

Country Status (1)

Country Link
CN (1) CN111782345B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527459A (en) * 2020-12-16 2021-03-19 新浪网技术(中国)有限公司 Log analysis method and device based on Kubernetes cluster
CN113535519A (en) * 2021-07-27 2021-10-22 浪潮软件科技有限公司 Monitoring and alarming method
CN114500249A (en) * 2022-04-18 2022-05-13 中国工商银行股份有限公司 Root cause positioning method and device
US11860752B2 (en) * 2021-12-15 2024-01-02 Bionic Stork Ltd. Agentless system and method for discovering and inspecting applications and services in compute environments

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009054847A1 (en) * 2007-10-23 2009-04-30 Qualcomm Incorporated Management of failures in wireless field devices
US20170111241A1 (en) * 2015-10-19 2017-04-20 Draios Inc. Automated service-oriented performance management
CN108572907A (en) * 2018-01-25 2018-09-25 北京金山云网络技术有限公司 A kind of alarm method, device, electronic equipment and computer readable storage medium
CN109245931A (en) * 2018-09-19 2019-01-18 四川长虹电器股份有限公司 The log management of container cloud platform based on kubernetes and the implementation method of monitoring alarm
CN110545195A (en) * 2018-05-29 2019-12-06 华为技术有限公司 network fault analysis method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009054847A1 (en) * 2007-10-23 2009-04-30 Qualcomm Incorporated Management of failures in wireless field devices
US20170111241A1 (en) * 2015-10-19 2017-04-20 Draios Inc. Automated service-oriented performance management
CN108572907A (en) * 2018-01-25 2018-09-25 北京金山云网络技术有限公司 A kind of alarm method, device, electronic equipment and computer readable storage medium
CN110545195A (en) * 2018-05-29 2019-12-06 华为技术有限公司 network fault analysis method and device
CN109245931A (en) * 2018-09-19 2019-01-18 四川长虹电器股份有限公司 The log management of container cloud platform based on kubernetes and the implementation method of monitoring alarm

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
D V UDAY: "An Analysis of Health System Log Files using ELK Stack", 《2019 4TH INTERNATIONAL CONFERENCE ON RECENT TRENDS ON ELECTRONICS, INFORMATION, COMMUNICATION & TECHNOLOGY (RTEICT)》 *
D V UDAY: "An Analysis of Health System Log Files using ELK Stack", 《2019 4TH INTERNATIONAL CONFERENCE ON RECENT TRENDS ON ELECTRONICS, INFORMATION, COMMUNICATION & TECHNOLOGY (RTEICT)》, 2 March 2020 (2020-03-02) *
罗学贯: "基于ELK的Web日志分析系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
罗学贯: "基于ELK的Web日志分析系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》, vol. 2019, no. 5, 15 May 2019 (2019-05-15) *
翟雅荣: "基于Filebeat自动收集Kubernetes日志的分析系统", 《计算机系统应用》, vol. 27, no. 9, 16 August 2018 (2018-08-16), pages 81 - 86 *
阮晓龙等: "基于ELK+Kafka的智慧运维大数据分析平台研究与实现", 《软件导刊》 *
阮晓龙等: "基于ELK+Kafka的智慧运维大数据分析平台研究与实现", 《软件导刊》, no. 06, 15 June 2020 (2020-06-15) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527459A (en) * 2020-12-16 2021-03-19 新浪网技术(中国)有限公司 Log analysis method and device based on Kubernetes cluster
CN112527459B (en) * 2020-12-16 2024-03-26 新浪技术(中国)有限公司 Log analysis method and device based on Kubernetes cluster
CN113535519A (en) * 2021-07-27 2021-10-22 浪潮软件科技有限公司 Monitoring and alarming method
CN113535519B (en) * 2021-07-27 2024-01-30 浪潮软件科技有限公司 Monitoring alarm method
US11860752B2 (en) * 2021-12-15 2024-01-02 Bionic Stork Ltd. Agentless system and method for discovering and inspecting applications and services in compute environments
CN114500249A (en) * 2022-04-18 2022-05-13 中国工商银行股份有限公司 Root cause positioning method and device
CN114500249B (en) * 2022-04-18 2022-07-08 中国工商银行股份有限公司 Root cause positioning method and device

Also Published As

Publication number Publication date
CN111782345B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN111782345B (en) Container cloud platform log collection and analysis alarm method
US10122575B2 (en) Log collection, structuring and processing
US9940373B2 (en) Method and system for implementing an operating system hook in a log analytics system
US20110314148A1 (en) Log collection, structuring and processing
US20120246303A1 (en) Log collection, structuring and processing
US8631283B1 (en) Monitoring and automated recovery of data instances
US8032489B2 (en) Log collection, structuring and processing
CA2629279C (en) Log collection, structuring and processing
US20200201699A1 (en) Unified error monitoring, alerting, and debugging of distributed systems
US8863224B2 (en) System and method of managing data protection resources
CN111046011B (en) Log collection method, system, device, electronic equipment and readable storage medium
US9411969B2 (en) System and method of assessing data protection status of data protection resources
CN106407030A (en) Failure processing method and system for storage cluster system
US11061669B2 (en) Software development tool integration and monitoring
CN110209518A (en) A kind of multi-data source daily record data, which is concentrated, collects storage method and device
WO2019047070A1 (en) Database maintenance method and system
US20210224102A1 (en) Characterizing operation of software applications having large number of components
US9922539B1 (en) System and method of telecommunication network infrastructure alarms queuing and multi-threading
JP4102592B2 (en) Failure information notification system with an aggregation function and a program for causing a machine to function as a failure information notification means with an aggregation function
US11805146B2 (en) System and method for detection promotion
WO2014196982A1 (en) Identifying log messages
CN108156061B (en) esb monitoring service platform
WO2019241199A1 (en) System and method for predictive maintenance of networked devices
CN112685486B (en) Data management method and device for database cluster, electronic equipment and storage medium
CN113821412A (en) Equipment operation and maintenance management method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant