CN114443433A

CN114443433A - Method and system for realizing distributed automatic alarm processing on cloud computing platform

Info

Publication number: CN114443433A
Application number: CN202210093267.4A
Authority: CN
Inventors: 刘伟; 江燕; 张勇; 石光银; 蔡卫卫; 高传集
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-06

Abstract

The invention discloses a method and a system for realizing distributed automatic alarm processing of a cloud computing platform, which belong to the technical field of cloud computing platform automation operation and maintenance and are realized based on a cloud native technology, and the method comprises an alarm subscription part and an alarm processing part, wherein the alarm subscription part realizes an alarm subscription resource controller, manages alarm subscription resources, and declares alarm information to be processed in the alarm subscription resources; the alarm processing part comprises an alarm processing server, an alarm processing distributor and an alarm processing agent, receives the alarm and matches with the alarm subscription resource, issues an alarm to-be-processed task and executes a specific alarm processing logic, and finally realizes the automatic processing of the alarm. The invention can solve the operation and maintenance problem under each scene, automatically process the alarm, reduce the manual error, improve the operation and maintenance precision, process massive alarm information, improve the efficiency, release the labor cost, support fault tolerance and have strong expandability.

Description

Method and system for realizing distributed automatic alarm processing on cloud computing platform

Technical Field

The invention relates to the technical field of cloud computing platform automation operation and maintenance, in particular to a method and a system for realizing distributed automatic alarm processing by a cloud computing platform.

Background

With the large-scale application of the cloud computing platform, the traditional application starts the cloud operation, and the cloud computing technology is used for providing application micro-services. With the increasing number of applications supported by the cloud computing platform and the number of nodes managed by the cloud computing platform, the operating system and the CPU architecture of the managed physical machine are different, and the multi-cluster and large-scale application scenarios under the cloud computing scenario provide higher requirements for operation and maintenance. On a cloud computing platform, providers of cloud computing resources and container services need to accurately monitor operation, service states of the cloud computing resources and container services and a cluster state of a bottom layer, and when an alarm rule is triggered, an operation and maintenance side receives alarm information, so that how to quickly and effectively process massive alarm information becomes an important problem. There is a need for a method that can respond in time, locate accurately, resolve quickly, and handle a large number of alarms.

The traditional operation and maintenance mode has some defects when processing large-scale alarm information. First, important alarms cannot be screened accurately, and the traffic volume increases sharply. When a major fault occurs and various information and a large number of alarms arrive at the same time, it is not easy to find useful alarms from thousands of information and filter out repeated information. Secondly, the IT operation and maintenance pressure is high, the operation and maintenance efficiency is low, the workload of operation and maintenance personnel is increased rapidly due to huge data flow, and even the root cause of the problem cannot be found out by monitoring for 7 × 24 hours, so that the service growth and the user experience are influenced. And thirdly, a unified management platform is lacked, a business system relates to numerous servers, services, applications, databases and network equipment, a unified comprehensive operation and maintenance management and control platform is lacked, the operation and maintenance difficulty is high, and the cost is high. In addition, the operation and maintenance process has more work items which are curable, routine and complex in operation steps, and the operation of manually writing scripts, executing commands and the like has low efficiency, high risk and untimely time. The problems of reducing operation and maintenance cost and risk, improving operation and maintenance efficiency and service satisfaction become urgent to be solved.

Disclosure of Invention

The technical task of the invention is to provide a method and a system for realizing distributed automatic alarm processing by a cloud computing platform aiming at the defects, which can solve the operation and maintenance problems in various scenes, automatically process alarms, reduce manual errors, improve the operation and maintenance accuracy, process massive alarm information, improve the efficiency, release the labor cost, support fault tolerance and have strong expandability.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for realizing distributed automatic alarm processing by a cloud computing platform is realized based on a cloud native technology and comprises an alarm subscription part and an alarm processing part,

the alarm subscription part realizes an alarm subscription resource controller, manages the alarm subscription resources and declares alarm information to be processed in the alarm subscription resources;

the alarm processing part comprises an alarm processing server, an alarm processing distributor and an alarm processing agent, receives the alarm and matches with the alarm subscription resource, issues an alarm to-be-processed task and executes a specific alarm processing logic, and finally realizes the automatic processing of the alarm.

The method supports multiple CPU architectures such as x86 and arm64, defines alarm subscription resources by combining the cloud native technology and the automatic operation and maintenance characteristics, and realizes the capacity of automatically processing alarms; the method operates in a cloud native environment, defines alarm subscription resources in a declarative manner, and supports the expansion of the alarm subscription resources.

Preferably, the alarm subscription part realizes the alarm subscription resource controller based on a Kubernetes Operator, and self-defines the alarm subscription resource; comprises that

The alarm subscription resource controller manages the alarm subscription resources in a hot plug mode and supports the increase, deletion, modification and check of the alarm subscription resources;

supporting expansion, realizing the addition of alarm resources as required, and only adding one alarm subscription resource when adding one type of alarm to be processed;

and fault tolerance is supported, the information of the alarm subscription resource can synchronously update processing logic according to the alarm scene requirement, and real-time online modification of the content of the alarm subscription resource is supported.

When the alarm processing resource is newly added or updated, the alarm processing service is prevented from being interrupted, and the flexibility and the expansibility of alarm processing are improved.

Preferably, the alarm information to be processed includes an alarm name, a description of the alarm, a mode of obtaining a node where the alarm is located, a resource version number, and an alarm consumer.

Further, the alarm processing logic is defined in the consumer, and the alarm consumer mainly comprises the following alarm processing information:

1) type: the alarm consumer type and the alarm processing support various types including shell, python, api, jobb and exec types,

wherein, the shell uses the shell script to process the alarm; python is to use python script to process alarm; the api calls the realized api interface to process the alarm; a joba resource is a kubberenets jobtype resource to process the alarm, and image images needed by the joba resource are also defined in the alarm consumer; exec runs executable binary system to process alarm; the alarm subscription resources are defined in an declarative mode, so that the method is simple and convenient and has strong expandability;

2) and runtime: the method comprises the following steps that the operation mode of a called alarm agent is adopted, the alarm agent support operation mode comprises service and a container, and the service represents that the called alarm agent operates on each node of a kubernets cluster in a service mode during alarm processing; the container shows that an alarm agent is started in a container mode by creating kubernets jobresource, and after alarm processing is completed, the container and jobs can be automatically cleaned; selecting an operation mode of an alarm agent according to scene needs;

3) and command: defining processing logic which can be a shell or python type script or an exec type executable statement;

4) and configmap: if the alarm processing logic of the shell or python is complex and the script content is more, the script can be put in kubernets configmap, and the configmap name is declared in the field;

5) and url: for the alarm consumers with the types of shell and python, url defines an alarm processing script inlet; for an alarm consumer of the api type, defined in url is an api interface; the concrete logic of alarm processing is realized by the service party through the interface;

6) and a node: at what node the alarm handling logic is executed, there are three types: single, any and all; single means at a given node, any means at any node, and all means at all nodes.

Preferably, the alarm processing server operates in a kubernets management side cluster, receives all alarm information pushed by an alarm source in a centralized manner, records an alarm to-be-processed task, stores the alarm information to a database, and provides restful api such as inquiring an alarm processing result;

the alarm processing distributor runs in a Kubernets working cluster and is responsible for distributing an alarm processing task to a specific node;

the alarm processing is responsible for executing specific alarm processing logic, the alarm agent runs on each working node of the Kubernetes working cluster in the form of systemctl service, and if the runtime defined by the alarm consumer matched with the alarm is of the service type, the alarm agent service on the node is called to process the alarm; if the runtime defined by the alarm consumer and matched with the alarm is of the container type, a pod of an alarm agent is started to process the alarm in a Kubernets jobe mode.

Therefore, the method is suitable for a multi-cluster management scene, and can realize centralized management and distributed processing of alarms on multiple clusters.

Preferably, the alarm processing steps are as follows:

1) deploying the alarm subscription resource:

alarm subscription resources are deployed through an alarm subscription resource controller, and alarm names and specific alarm consumers to be processed are declared in the alarm subscription resources;

2) and receiving an alarm:

the alarm processing server receives the alarm information, records the task to be processed of the alarm and stores the task to the database;

3) matching and alarming:

the alarm processing distributor polls alarm information to an alarm processing server, after the alarm information is polled, the alarm subscription resources are matched according to the information such as the alarm name, the name space, the node where the alarm is located and the like, the alarm subscription resources related to the alarm are filtered, and the alarm consumer information is taken according to the matched alarm subscription resources;

4) distributing alarm processing information:

the alarm distributor distributes alarm processing tasks according to the matched alarm consumer information, and the alarm distribution tasks are divided into three conditions:

a) for the alarm consumer processed by the appointed node (single), distributing the alarm processing logic and the program to the appointed node for program execution;

b) for the alarm consumers which do not specify the node (any) to process, randomly distributing the alarm processing logic and the program to a node to execute;

c) for the alarm consumers processed at all nodes (all), distributing alarm processing logic and programs to all working nodes for execution;

5) and processing alarm:

the alarm processing agent is deployed on each node and executes specific alarm processing logic:

a) executing the alarm processing logic of the shell script for the alarm consumers of the shell type;

b) executing alarm processing logic of the python script for the alarm consumers of the python type;

c) for the alarm consumers of the api type, calling a designated api interface to process the alarm;

d) for a job type alarm consumer, starting a job resource processing alarm on the kubernets cluster, wherein the specific alarm logic is in a mirror image;

e) for exec type alarm consumers, operating corresponding executable binary system to process alarms;

and returning the result after executing the alarm processing logic, and finally storing the alarm result in the database through the alarm server.

Preferably, the method also provides alarm processing result query, and can query the alarm to-be-processed task and the alarm processing result in real time; and the method expands the alarm processing query mode, and a user can query the alarm processing result in an interface mode of the alarm processing system or in real time through a front-end web interface.

The invention also claims a system for realizing distributed automatic alarm processing by a cloud computing platform, which comprises an alarm subscription module and an alarm processing module,

the system realizes the method for realizing the distributed automatic alarm processing by the cloud computing platform.

The invention also claims a device for realizing distributed automatic alarm processing by the cloud computing platform, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is used for calling the machine readable program and executing the method for realizing the distributed automatic alarm processing by the cloud computing platform.

The invention also claims a computer readable medium, on which computer instructions are stored, which when executed by a processor, cause the processor to execute the above-mentioned method for the cloud computing platform to implement distributed automatic alarm processing.

Compared with the prior art, the method and the system for realizing distributed automatic alarm processing by the cloud computing platform have the following beneficial effects:

the method realizes automatic solution alarm and automatic operation and maintenance, can be finally evolved into intelligent operation and maintenance, supports various cpu architectures and operation and maintenance under various scenes, and can be greatly convenient for developers and operation and maintenance personnel of public cloud and private cloud of cloud manufacturers and related alarm processing services of mixed cloud products.

The alarm subscription resources are managed in a hot plug mode, the expansibility of the alarm subscription resources is strong, fault tolerance is supported, and the use is flexible and convenient;

the multi-CPU architecture and heterogeneous processing are supported, different components run on different cluster sides, and the multi-cluster scene can be adapted;

the method supports a multi-dimensional alarm processing mode, developers of alarm processing logic can flexibly select the alarm processing mode adaptive to own services, and full-time operation and maintenance personnel do not need to pay attention to the specific mode of the alarm processing logic;

the alarm is automatically processed, so that the labor cost and the production cost are reduced;

and massive alarms are processed in a distributed manner, so that the operation and maintenance efficiency is improved.

Drawings

FIG. 1 is a diagram of a method for implementing distributed automatic alarm processing by a cloud computing platform according to an embodiment of the present invention;

FIG. 2 is a flow diagram of automated alarm handling provided by an embodiment of the present invention;

fig. 3 is an exemplary diagram of an automated processing alarm result query interface according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The embodiment of the invention provides a method for realizing distributed automatic alarm processing on a cloud computing platform, which supports multiple CPU architectures such as x86 and arm64, defines alarm subscription resources by combining cloud native technology and automatic operation and maintenance characteristics, and realizes the capacity of automatically processing alarms. The system comprises an alarm subscription part and an alarm processing part.

The alarm subscription part realizes an alarm subscription resource controller and manages the alarm subscription resources; the alarm processing part comprises an alarm processing server, an alarm processing distributor and an alarm processing agent, receives the alarm and matches with the alarm subscription resource, issues an alarm to-be-processed task and executes a specific alarm processing logic, and finally realizes the automatic processing of the alarm. Fig. 1 is a schematic diagram of the operation of the method.

The alarm subscription controller is realized based on a kubernets operator mode, defines alarm subscription resources in a declaration mode, declares alarms needing to be processed and alarm consumers in the alarm subscription resources, and defines alarm processing logic in the alarm consumers. The alarm subscription resources are customized, the alarm subscription controller supports the addition, deletion, modification and check of the alarm subscription resources, the expandability is strong, and the alarm subscription resources can synchronously process logic in real time according to the alarm scene requirement, can be modified at any time and support fault tolerance.

When a kubernets developer needs to extend kubernets capabilities, an abstraction of the capabilities that it wants to extend can be provided, along with a controller that implements the concrete logic of this abstraction. The former is called CRD (custom Resource definition), and the latter is called Controller. The Operator mode is a mode for realizing the Kubernetes expansibility in this way. The method introduces a kubernetes operator mode, realizes the alarm subscription resource controller through the operator mode, and has strong expandability because the alarm service party only needs to provide corresponding alarm subscription resources.

The alarm subscription resource controller manages the alarm subscription resources in a hot plug mode and supports the increase, deletion, modification and check of the alarm subscription resources; supporting expansion, realizing the addition of alarm resources as required, and only adding one alarm subscription resource when adding one type of alarm to be processed; and fault tolerance is supported, the information of the alarm subscription resource can synchronously update processing logic according to the alarm scene requirement, and real-time online modification of the content of the alarm subscription resource is supported. When the alarm processing resource is newly added or updated, the alarm processing service is prevented from being interrupted, and the flexibility and the expansibility of alarm processing are improved.

And declaring information such as alarm names to be processed, alarm description, a mode of acquiring nodes where the alarms are located, resource version numbers, alarm consumers and the like in the alarm subscription resources. Wherein, the logic of alarm processing is defined in the consumer, and the alarm consumer mainly comprises the following alarm processing information:

3) and command: defining processing logic which can be shell or python type script or exec type executable statement;

The alarm subscription resource declares information such as alarm names and alarm consumers to be processed, and defines alarm processing information in the alarm consumers, supports fault tolerance of the alarm subscription resource, and can increase, delete, modify and check the alarm subscription resource on line in real time. Alarm consumers can be divided into different types, including a shell type, a python type, an api type, a jobtype and an exec type, and support alarm processing modes with multiple dimensions; the alarm agent can run in two different modes, namely service or container, and is adaptive to various alarm processing scenes; the alarm processing component can selectively operate in a management side cluster or a working cluster, support alarm processing in a multi-cluster management scene, support heterogeneous management of the alarm component and realize distributed alarm processing.

And an alarm processing part:

the alarm processing server operates in a Kubernets management side cluster, receives all alarm information pushed by an alarm source in a centralized manner, records an alarm task to be processed, stores the alarm information to a database, and provides restful api such as inquiring alarm processing results and the like;

the alarm processing distributor runs in a Kubernetes working cluster and is responsible for distributing an alarm processing task to a specific node;

The method also provides alarm processing result query, can query the alarm to-be-processed task and the alarm processing result in real time, and is convenient for users to know the alarm processing condition.

As shown in fig. 2, the alarm processing steps are as follows:

1) deploying the alarm subscription resource:

2) and receiving an alarm:

3) matching and alarming:

4) distributing alarm processing information:

b) for the alarm consumers which are not appointed to process the node (any), randomly distributing alarm processing logic and programs to a node to execute;

5) and processing alarm:

The method expands the alarm processing query mode, and a user can query the alarm processing result in an interface mode of the alarm processing system or in real time through a front-end web interface. Fig. 3 is a diagram illustrating an example of the query alarm processing result.

The alarm subscription controller is realized based on a kubernets operator mode, defines alarm subscription resources in a declaration mode, declares alarms needing to be processed and alarm consumers in the alarm subscription resources, and defines alarm processing logic in the alarm consumers. The alarm controller supports the addition, deletion, modification and check of the alarm subscription resources, has strong expandability, and the alarm subscription resources can synchronously process logic in real time according to the alarm scene requirement, can be modified at any time, and supports fault tolerance. The alarm processing part comprises an alarm processing server, an alarm processing distributor and an alarm processing agent. After receiving the alarm information, the alarm processing server matches the alarm subscription resource corresponding to the alarm to obtain the alarm consumer corresponding to the alarm subscription resource, and then the alarm processing distributor distributes the alarm processing information to the alarm processing agent on the corresponding node to execute the specific alarm processing logic and store the alarm processing result which can be checked in real time.

The method is operated in a cloud native environment, the alarm subscription resources are defined in a declarative mode, and the expansion of the alarm subscription resources is supported. The method can accurately filter important alarm information from massive alarms, accurately execute alarm processing logic, and has the advantages of quick alarm response, timely alarm processing and high alarm processing efficiency.

The embodiment of the invention also provides a system for realizing distributed automatic alarm processing by the cloud computing platform, which comprises an alarm subscription module and an alarm processing module,

the system realizes the method for realizing distributed automatic alarm processing by the cloud computing platform in the embodiment of the invention.

The system is combined with a cloud computing technology, and the operation and maintenance problems under various scenes are solved, wherein the operation and maintenance problems include but are not limited to scenes for providing bottom-layer services such as public clouds, private clouds, mixed clouds and the like; the automatic alarm processing can be realized, the manual errors are reduced, and the operation and maintenance precision is improved; massive alarm information is processed, and efficiency is improved; labor cost is released, and production cost is reduced; fault tolerance is supported, and expandability is strong.

The embodiment of the invention also provides a device for realizing distributed automatic alarm processing by a cloud computing platform, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine-readable program to execute the method for implementing distributed automatic alarm processing by the cloud computing platform according to the foregoing embodiment of the present invention.

The embodiment of the present invention further provides a computer readable medium, where a computer instruction is stored on the computer readable medium, and when the computer instruction is executed by a processor, the processor is enabled to execute the method for implementing distributed automatic alarm processing by a cloud computing platform according to the above embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims

1. A method for realizing distributed automatic alarm processing by a cloud computing platform is characterized by being realized based on a cloud native technology and comprising an alarm subscription part and an alarm processing part,

2. The method for realizing distributed automatic alarm processing by a cloud computing platform according to claim 1, wherein the alarm subscription part realizes an alarm subscription resource controller based on a Kubernetes Operator, and defines the alarm subscription resource; comprises that

3. The method for implementing distributed automatic alarm processing by the cloud computing platform according to claim 1 or 2, wherein the alarm information to be processed includes an alarm name, a description of the alarm, a mode of obtaining a node where the alarm is located, a resource version number, and an alarm consumer.

4. The method of claim 3, wherein the alarm processing logic is defined in a consumer, and the alarm consumer mainly comprises the following alarm processing information:

wherein, the shell uses the shell script to process the alarm; python is to use python script to process alarm; the api calls the realized api interface to process the alarm; a joba resource is a kubberenets jobtype resource to process the alarm, and image images needed by the joba resource are also defined in the alarm consumer; exec runs executable binary system to process alarm;

2) and runtime: the method comprises the following steps that the operation mode of a called alarm agent is adopted, the alarm agent support operation mode comprises service and a container, and the service represents that the called alarm agent operates on each node of a kubernets cluster in a service mode during alarm processing; the container shows that an alarm agent is started in a container mode by creating kubernets jobresource, and after alarm processing is completed, the container and jobs can be automatically cleaned;

5. The method for realizing distributed automatic alarm processing by the cloud computing platform according to claim 1 or 2, wherein the alarm processing server runs in a kubernets management side cluster, receives all alarm information pushed by alarm sources in a centralized manner, records an alarm to-be-processed task, stores the alarm information in a database, and provides restful api;

the alarm agent runs on each working node of the Kubernetes working cluster in the form of systemctl service, and if the runtime defined by the alarm consumer and matched with the alarm is of the service type, the alarm agent service on the node is called to process the alarm; if the runtime defined by the alarm consumer and matched with the alarm is of the container type, a pod of an alarm agent is started to process the alarm in a Kubernets jobe mode.

6. The method for realizing distributed automatic alarm processing by the cloud computing platform according to claim 5, wherein the alarm processing steps are as follows:

1) deploying the alarm subscription resource:

2) and receiving an alarm:

3) matching and alarming:

the alarm processing distributor polls alarm information to the alarm processing server, matches the alarm subscription resources after polling the alarm information, filters the alarm subscription resources associated with the alarm, and takes the alarm consumer information according to the matched alarm subscription resources;

4) distributing alarm processing information:

a) for the alarm consumers processed by the appointed nodes, distributing the alarm processing logic and the program to the appointed nodes for program execution;

b) for the alarm consumers which do not specify the node processing, the alarm processing logic and the program are randomly distributed to one node to be executed;

c) for the alarm consumers processed at all the nodes, the alarm processing logic and the program are distributed to all the working nodes to be executed;

5) and processing alarm:

c) for an api type alarm consumer, calling a designated api interface to process an alarm;

7. The method for realizing distributed automatic alarm processing by the cloud computing platform according to claim 1 or 2, wherein the method further provides alarm processing result query, which can query an alarm task to be processed and an alarm processing result in real time; and the method expands the alarm processing query mode, and a user can query the alarm processing result in an interface mode of the alarm processing system or in real time through a front-end web interface.

8. A system for realizing distributed automatic alarm processing by a cloud computing platform is characterized by comprising an alarm subscription module and an alarm processing module,

the system realizes the method for automatically processing the alarm in a distributed mode by the cloud computing platform according to any one of claims 1 to 7.

9. An apparatus for realizing distributed automatic alarm processing by a cloud computing platform is characterized by comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is used for calling the machine readable program and executing the method for realizing the distributed automatic alarm processing by the cloud computing platform according to any one of claims 1 to 7.

10. A computer readable medium having stored thereon computer instructions, which when executed by a processor, cause the processor to perform the method of the cloud computing platform of any of claims 1 to 7 for distributed automated processing of alerts.