CN114398221A - Operation and maintenance processing method and device for container cloud platform and processor - Google Patents

Operation and maintenance processing method and device for container cloud platform and processor Download PDF

Info

Publication number
CN114398221A
CN114398221A CN202111537711.9A CN202111537711A CN114398221A CN 114398221 A CN114398221 A CN 114398221A CN 202111537711 A CN202111537711 A CN 202111537711A CN 114398221 A CN114398221 A CN 114398221A
Authority
CN
China
Prior art keywords
fault
cloud platform
container cloud
monitoring data
processing method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111537711.9A
Other languages
Chinese (zh)
Inventor
吕红垒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202111537711.9A priority Critical patent/CN114398221A/en
Publication of CN114398221A publication Critical patent/CN114398221A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides an operation and maintenance processing method, device, processor and storage medium for a container cloud platform. The method comprises the following steps: acquiring monitoring data and alarm data of a container cloud platform; transmitting the monitoring data and the alarm data to a fault processing engine; analyzing the monitoring data and the alarm data through a fault processing engine to determine rule plug-ins correspondingly processed; and calling an execution engine corresponding to the rule plug-in to process the fault problems corresponding to the monitoring data and the alarm data in a Pipeline mode through the execution engine. Through the technical scheme, the fault problem can be timely and effectively found out, the fault problem can be pertinently analyzed, the automatic processing of the fault problem is realized, and the efficiency and the accuracy of processing the fault problem are further improved. Meanwhile, the fault problem is expanded by means of plug-in, and the problem that the same fault occurs in the follow-up process can be solved quickly and effectively.

Description

Operation and maintenance processing method and device for container cloud platform and processor
Technical Field
The application relates to the field of platform operation and maintenance, in particular to an operation and maintenance processing method, device, processor, machine-readable storage medium and computer program product for a container cloud platform.
Background
Because a complete container cloud platform is widely applied and has high complexity, a plurality of system and application problems are caused when the container cloud platform fails, the failure problem is difficult to accurately judge the origin, and a method for solving the failure problem cannot be provided in time.
In the current traditional data center, the operation and maintenance method mainly adopted by the container cloud platform is as follows: and the operation and maintenance personnel carry out operation and maintenance treatment on the faults of the container cloud platform according to the alarm prompt sent by the monitoring system or by the operation and maintenance personnel. However, the first-line operation and maintenance personnel has limited capability of solving the problems, needs intervention processing of the second-line operation and maintenance personnel or research and development personnel, has low execution efficiency, and cannot solve the problems in time. And because the manual work participates in the fault analysis, the accuracy is lower and the time is not timely enough. An intelligent fault processing solution cannot be formed after the fault problem is solved, and the same problem needs to be analyzed and processed again in the follow-up process, so that the time cost is high.
Disclosure of Invention
An object of the embodiment of the present application is to provide an operation and maintenance processing method, an apparatus, a processor, a machine-readable storage medium and a computer program product for a container cloud platform.
In order to achieve the above object, a first aspect of the present application provides an operation and maintenance processing method for a container cloud platform, including: acquiring monitoring data and alarm data of a container cloud platform; transmitting the monitoring data and the alarm data to a fault processing engine; analyzing the monitoring data and the alarm data through a fault processing engine to determine rule plug-ins correspondingly processed; and calling an execution engine corresponding to the rule plug-in to process the fault problems corresponding to the monitoring data and the alarm data in a Pipeline mode through the execution engine, wherein the Pipeline comprises a plurality of tasks which are executed in parallel or in series.
In the embodiment of the present application, acquiring the monitoring data and the alarm data of the container cloud platform includes: acquiring monitoring data acquired by Prometheus; and acquiring alarm data acquired through an alarm rule defined by the AlertManager.
In the embodiment of the present application, analyzing the monitoring data and the alarm data by the fault handling engine to determine the rule plug-in component processed correspondingly includes: analyzing the monitoring data and the alarm data through a fault processing engine to determine event characteristics corresponding to fault problems of the monitoring data and the alarm data; and traversing all plug-ins in a chain of responsibility mode according to the event characteristics to determine regular plug-ins for processing the fault problems, wherein each plug-in corresponds to one fault problem and a corresponding solution.
In this embodiment of the present application, the operation and maintenance processing method further includes: and acquiring and storing an execution result of the Pipeline, and recording processing information aiming at the fault problem.
In the embodiment of the application, each task comprises a plurality of execution steps, and the execution steps are executed in a serial mode.
In this embodiment of the present application, the operation and maintenance processing method further includes: in the process of processing the fault problem, sending a local command to a Node agent, wherein the Node agent comprises Node nodes; the local command is executed by the Node.
In this embodiment of the present application, the operation and maintenance processing method further includes: and under the condition that the Node can not be logged in or the Node is called without permission, calling the proxy interface to send a local command to the Node agent so as to execute the local command through the Node.
In the embodiment of the application, each plug-in has a corresponding container image, script and configuration file.
A second aspect of the present application provides a processor configured to execute the operation and maintenance processing method for a container cloud platform.
The third aspect of the application provides an operation and maintenance processing device for a container cloud platform, which comprises the processor.
A fourth aspect of the present application provides a machine-readable storage medium having stored thereon instructions, which when executed by a processor, cause the processor to be configured to perform the above-mentioned operation and maintenance processing method for a container cloud platform.
A fifth aspect of the present application provides a computer program product comprising a computer program which, when executed by a processor, implements any one of the above-described operation and maintenance processing methods for a container cloud platform.
Through the technical scheme, the fault problem can be timely and effectively found out, the fault problem can be pertinently analyzed, the automatic processing of the fault problem is realized, and the efficiency and the accuracy of processing the fault problem are further improved. Meanwhile, the fault problem is expanded by means of plug-in, and the problem that the same fault occurs in the follow-up process can be solved quickly and effectively.
Additional features and advantages of embodiments of the present application will be described in detail in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the embodiments of the disclosure, but are not intended to limit the embodiments of the disclosure. In the drawings:
fig. 1 schematically shows a flow chart of an operation and maintenance processing method for a container cloud platform according to an embodiment of the present application;
FIG. 2 schematically illustrates a schematic diagram of an operation and maintenance processing method for a container cloud platform according to an embodiment of the present application;
fig. 3 schematically shows an internal structure diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the specific embodiments described herein are only used for illustrating and explaining the embodiments of the present application and are not used for limiting the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 schematically shows a flowchart of an operation and maintenance processing method for a container cloud platform according to an embodiment of the present application. As shown in fig. 1, in an embodiment of the present application, there is provided an operation and maintenance processing method for a container cloud platform, including the following steps:
step 101, acquiring monitoring data and alarm data of a container cloud platform.
Step 102, transmitting the monitoring data and the alarm data to a fault handling engine.
And 103, analyzing the monitoring data and the alarm data through the fault processing engine to determine rule plug-ins correspondingly processed.
And 104, calling an execution engine corresponding to the rule plug-in to process the fault problems corresponding to the monitoring data and the alarm data in a Pipeline mode through the execution engine, wherein the Pipeline comprises a plurality of tasks which are executed in parallel or in series.
The container cloud platform can refer to a container cluster management platform realized by Kubernets, and can provide resource management and cluster scheduling. Flexible expansion, service discovery, etc. Kubernets may refer to a portable, extensible, open source platform. When the operation and maintenance fault of the container cloud platform is processed, the processor may first obtain monitoring data and alarm data of the container cloud platform. The monitoring data may include resource information, system information, service information, and the like of the container cloud platform. The alarm data may include alarm targets, memory utilization, and CPU utilization. The alarm target may be a virtual machine and a container, etc. Pipeline may refer to a fault handling Pipeline. Among other things, Pipeline may include multiple tasks that are executed in parallel or serially.
In one embodiment, obtaining the monitoring data and the alarm data of the container cloud platform comprises: acquiring monitoring data acquired by Prometheus; and acquiring alarm data acquired through an alarm rule defined by the AlertManager.
In particular, the processor may acquire monitoring data collected by Prometheus. Prometheus can refer to an open source monitoring system commonly used for a container cloud platform, and monitoring index data on a monitoring target can be periodically acquired according to configured tasks. The Prometheus can store the collected continuous monitoring data in the time dimension according to the same time sequence. The processor may obtain alarm data collected via alarm rules defined by the AlertManager. The alert manager can refer to an alert time notification system, and can support alert prompting in various communication modes such as mails and short messages. The alert manager may perform group management on the alert data to send different types of alert information to different users or systems. The alarm rules defined by the AlertManager can be customized according to actual requirements. For example, the alert rule of alert manager may be to perform alert prompting if the memory utilization rate in the alert data is greater than 90% and/or the CPU utilization rate is greater than 90%.
After the monitoring data and the alarm data of the container cloud platform are acquired, the processor may transmit the monitoring data and the alarm data to the fault handling engine. Further, the Prometheus may provide an application program interface externally, and the processor may transmit the monitoring data to the failure processing engine through the application program interface externally provided by the Prometheus. The alert manager can customize the notification interface, and the processor can transmit the alarm data to the fault handling engine through the customized notification interface of the alert manager.
In the case of transmitting the monitoring data and the alarm data to the failure processing engine, the processor may analyze the monitoring data and the alarm data through the failure processing engine to determine a rule plug-in for a corresponding process. In one embodiment, analyzing the monitoring data and the alarm data by the fault handling engine to determine rule plug-ins for corresponding processing comprises: analyzing the monitoring data and the alarm data through a fault processing engine to determine event characteristics corresponding to fault problems of the monitoring data and the alarm data; and traversing all plug-ins in a chain of responsibility mode according to the event characteristics to determine regular plug-ins for processing the fault problems, wherein each plug-in corresponds to one fault problem and a corresponding solution.
The processor may analyze the monitoring data and the alarm data through the fault handling engine to determine event characteristics corresponding to fault problems of the monitoring data and the alarm data. Event characteristics may refer to alarm targets and condition information that triggers alarms. The alarm target may be a virtual machine and a container, etc. The condition information for triggering the alarm may refer to that the memory utilization rate of the container cloud platform is greater than 90% and/or the CPU utilization rate is greater than 90%, and the like.
The fault handling engine may contain a number of rule plug-ins that can handle fault problems. The problem of the fault and the task to be performed to solve the fault can be defined in the rule plug-ins, i.e. each plug-in is a solution that can correspond to one fault problem and a corresponding solution. In one embodiment, each plug-in has a corresponding container image, script, and configuration file that defines what the corresponding plug-in needs to do at each step of performing the task. Because the fault problem defined by each plug-in and the corresponding solution are different, under the condition of determining the event characteristic corresponding to the fault problem, the processor can traverse all the plug-ins in a chain of responsibility mode according to the event characteristic so as to determine the regular plug-ins for processing the fault problem. And traversing all the plug-ins in a responsibility chain mode, namely comparing the event characteristics corresponding to the current fault problem with the event characteristics corresponding to the fault problem defined by each plug-in according to a preset sequence. When the current plug-in can solve the fault problem, that is, the current plug-in is confirmed to meet the processing requirement, the current plug-in can be determined to be a regular plug-in for processing the fault problem.
For example, if the failure handling engine includes plug-ins A, B and C. The event characteristics corresponding to the failure problem defined by the plug-in A are that the memory utilization rate of the virtual machine is 92% and the CPU utilization rate is 93%. The event characteristics corresponding to the failure problem defined by the B plug-in are that the memory utilization rate of the virtual machine is 92% and the CPU utilization rate is 95%. The event characteristics corresponding to the failure problem defined by the C-plug-in are that the memory utilization of the container is 92% and the CPU utilization is 93%. And the event characteristics corresponding to the current fault problem are that the memory utilization rate of the virtual machine is 92% and the CPU utilization rate is 93%. Thus, where the processor traverses plug-ins A, B and C in a pattern of chain of responsibility based on the event characteristics, the rule plug-in that handles the failure problem may be determined to be A.
Under the condition that the rule plug-in which the corresponding processing is performed is determined, the processor can process the fault problems corresponding to the monitoring data and the alarm data in a Pipeline mode through the execution engine according to the execution engine corresponding to the rule plug-in which is called. Where the execution engine may be referred to as Argo. Argo may refer to a kubernets-based cloud-native pipeline open source project, may provide a workflow for kubernets with local container operations, and implements each step in the workflow as a container. Pipeline may refer to a fault handling Pipeline. Among other things, Pipeline may include multiple tasks that are executed in parallel or serially. For example, when the precondition is checked, the tasks can be executed in a parallel manner to improve the checking efficiency. In one embodiment, each task includes multiple execution steps, with execution steps being performed in a serial manner between execution steps. Each task in Pipeline may consist of multiple execution steps. Each step may be performed in a serial fashion to ensure sequential execution of logic within the same execution unit. The failure problem is processed by adopting a mode of Pipeline combination task and execution step, so that the expansion and function multiplexing of the task and the execution step can be facilitated, and the task or the execution step before the subsequent Pipeline application can be facilitated.
In one embodiment, the operation and maintenance processing method further includes: and acquiring and storing an execution result of the Pipeline, and recording processing information aiming at the fault problem.
After the fault problems corresponding to the monitoring data and the alarm data are processed by the execution engine in a Pipeline mode, the processor can acquire and store the execution result of the Pipeline and record the processing information aiming at the fault problems so as to improve and improve the fault processing in the following process.
In one embodiment, the operation and maintenance processing method further includes: in the process of processing the fault problem, sending a local command to a Node agent, wherein the Node agent comprises Node nodes; the local command is executed by the Node.
If the Node on the Node agent can be logged in or called at this time, the processor can execute the local command through the Node. The local command may refer to an ip command, an lvm command, a dock command, and the like. A node agent may refer to a node agent. The Node agent can comprise Node nodes, all the Node nodes can be operated through the Kubernetes resource controller, and fault information on the Node nodes can be collected. The Node agent can also receive the dispatching of the interface task, and can realize the fault repair function of the Node by the component.
In one embodiment, the operation and maintenance processing method further includes: and under the condition that the Node can not be logged in or the Node is called without permission, calling the proxy interface to send a local command to the Node agent so as to execute the local command through the Node.
In the process of processing the fault problem, a local command needs to be executed, but the Node on the Node agent cannot be logged in or the Node cannot be called without permission at the moment, and then the processor can call the proxy interface to send the local command to the Node agent. Then, the Node agent can trigger the Node of the Node agent by encapsulating the local command and calling the proxy interface, so as to execute the local command through the Node. Wherein, the agent interface can refer to an application program interface of the nodagent.
In one embodiment, as shown in FIG. 2, a schematic illustration of an operation and maintenance processing method for a container cloud platform is provided.
The Prometheus can store the collected continuous monitoring data in the time dimension according to the same time sequence. The alert manager may provide the defined alert rules to collect alert data. Prometheus and alert manager may transmit collected monitoring data and alarm data to the RuleEngine, i.e., the fault handling engine. The RuleEngine may contain a plurality of rule plug-ins. Namely, Rulepugin 1 and Rulepugin n, etc., as shown in FIG. 2. Wherein, Pipeline jobis the execution process of the Pipeline task.
The RuleEngine may analyze the monitoring data and the alarm data to determine event characteristics corresponding to the fault problems of the monitoring data and the alarm data. Under the condition of determining the event characteristics corresponding to the fault problem, the RuleEngine can traverse all plug-ins in a chain of responsibility mode according to the event characteristics to determine the rule plug-ins for processing the fault problem. After the rule plug-in is determined, the rule plug-in initiates a fault handling Pipeline in the manner of a CRD, that is, Pipeline as shown in fig. 2. Pipeline may include multiple tasks that execute in parallel or in series. Each task includes multiple steps, which may be performed in a serial manner. In the case of initiating Pipeline, the fault problem can be handled by calling Argo corresponding to the rule plug-in. Argo may refer to an execution engine.
A Node agent as shown in fig. 2 may include Node nodes. When the local command needs to be executed in the process of calling the Argo to process the fault problem, but the Node on the Node agent cannot be logged in or the Node cannot be called without permission at the moment, the Argo can firstly call the API of the Node agent to send the local command to the Node agent. Then, the Node agent can execute the local command through the Node by encapsulating the local command and triggering the Node of the Node agent.
Through the technical scheme, the fault problem can be timely and effectively found out, the fault problem can be pertinently analyzed, the automatic processing of the fault problem is realized, and the efficiency and the accuracy of processing the fault problem are further improved. Meanwhile, the fault problem is expanded by means of plug-in, and the problem that the same fault occurs in the follow-up process can be solved quickly and effectively.
Fig. 1 is a schematic flow chart of an operation and maintenance processing method for a container cloud platform in an embodiment. It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The embodiment of the application provides a processor, wherein the processor is used for running a program, and the operation and maintenance processing method for the container cloud platform is executed when the program runs.
The embodiment of the application provides an operation and maintenance processing device for a container cloud platform, which comprises the processor. The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the operation and maintenance processing method for the container cloud platform is realized by adjusting the kernel parameters.
The embodiment of the application provides a storage medium, wherein a program is stored on the storage medium, and when the program is executed by a processor, the operation and maintenance processing method for the container cloud platform is realized.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 3. The computer device includes a processor a01, a network interface a02, a memory (not shown), and a database (not shown) connected by a system bus. Wherein processor a01 of the computer device is used to provide computing and control capabilities. The memory of the computer device comprises an internal memory a03 and a non-volatile storage medium a 04. The non-volatile storage medium a04 stores an operating system B01, a computer program B02, and a database (not shown in the figure). The internal memory a03 provides an environment for the operation of the operating system B01 and the computer program B02 in the nonvolatile storage medium a 04. The database of the computer equipment is used for storing data such as monitoring, alarming and the like. The network interface a02 of the computer device is used for communication with an external terminal through a network connection. The computer program B02 when executed by processor a01 implements an operation and maintenance processing method for a container cloud platform.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The embodiment of the application provides equipment, the equipment comprises a processor, a memory and a program which is stored on the memory and can run on the processor, and the following steps are realized when the processor executes the program: acquiring monitoring data and alarm data of a container cloud platform; transmitting the monitoring data and the alarm data to a fault processing engine; analyzing the monitoring data and the alarm data through a fault processing engine to determine rule plug-ins correspondingly processed; and calling an execution engine corresponding to the rule plug-in to process the fault problems corresponding to the monitoring data and the alarm data in a Pipeline mode through the execution engine, wherein the Pipeline comprises a plurality of tasks which are executed in parallel or in series.
In one embodiment, obtaining the monitoring data and the alarm data of the container cloud platform comprises: acquiring monitoring data acquired by Prometheus; and acquiring alarm data acquired through an alarm rule defined by the AlertManager.
In one embodiment, analyzing the monitoring data and the alarm data by the fault handling engine to determine rule plug-ins for corresponding processing comprises: analyzing the monitoring data and the alarm data through a fault processing engine to determine event characteristics corresponding to fault problems of the monitoring data and the alarm data; and traversing all plug-ins in a chain of responsibility mode according to the event characteristics to determine regular plug-ins for processing the fault problems, wherein each plug-in corresponds to one fault problem and a corresponding solution.
In one embodiment, the operation and maintenance processing method further includes: and acquiring and storing an execution result of the Pipeline, and recording processing information aiming at the fault problem.
In one embodiment, each task includes multiple execution steps, with execution steps being performed in a serial manner between execution steps.
In one embodiment, the operation and maintenance processing method further includes: in the process of processing the fault problem, sending a local command to a Node agent, wherein the Node agent comprises Node nodes; the local command is executed by the Node.
In one embodiment, the operation and maintenance processing method further includes: and under the condition that the Node can not be logged in or the Node is called without permission, calling the proxy interface to send a local command to the Node agent so as to execute the local command through the Node.
In one embodiment, each plug-in has a corresponding container image, script, and configuration file.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring monitoring data and alarm data of a container cloud platform; transmitting the monitoring data and the alarm data to a fault processing engine; analyzing the monitoring data and the alarm data through a fault processing engine to determine rule plug-ins correspondingly processed; and calling an execution engine corresponding to the rule plug-in to process the fault problems corresponding to the monitoring data and the alarm data in a Pipeline mode through the execution engine, wherein the Pipeline comprises a plurality of tasks which are executed in parallel or in series.
In one embodiment, obtaining the monitoring data and the alarm data of the container cloud platform comprises: acquiring monitoring data acquired by Prometheus; and acquiring alarm data acquired through an alarm rule defined by the AlertManager.
In one embodiment, analyzing the monitoring data and the alarm data by the fault handling engine to determine rule plug-ins for corresponding processing comprises: analyzing the monitoring data and the alarm data through a fault processing engine to determine event characteristics corresponding to fault problems of the monitoring data and the alarm data; and traversing all plug-ins in a chain of responsibility mode according to the event characteristics to determine regular plug-ins for processing the fault problems, wherein each plug-in corresponds to one fault problem and a corresponding solution.
In one embodiment, the operation and maintenance processing method further includes: and acquiring and storing an execution result of the Pipeline, and recording processing information aiming at the fault problem.
In one embodiment, each task includes multiple execution steps, with execution steps being performed in a serial manner between execution steps.
In one embodiment, the operation and maintenance processing method further includes: in the process of processing the fault problem, sending a local command to a Node agent, wherein the Node agent comprises Node nodes; the local command is executed by the Node.
In one embodiment, the operation and maintenance processing method further includes: and under the condition that the Node can not be logged in or the Node is called without permission, calling the proxy interface to send a local command to the Node agent so as to execute the local command through the Node.
In one embodiment, each plug-in has a corresponding container image, script, and configuration file.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (12)

1. An operation and maintenance processing method for a container cloud platform is characterized by comprising the following steps:
acquiring monitoring data and alarm data of a container cloud platform;
transmitting the monitoring data and the alarm data to a fault handling engine;
analyzing the monitoring data and the alarm data through a fault processing engine to determine rule plug-ins correspondingly processed;
and calling an execution engine corresponding to the rule plug-in to process the fault problems corresponding to the monitoring data and the alarm data in a Pipeline mode through the execution engine, wherein the Pipeline comprises a plurality of tasks which are executed in parallel or in series.
2. The operation and maintenance processing method for the container cloud platform according to claim 1, wherein the acquiring of the monitoring data and the alarm data of the container cloud platform includes:
acquiring monitoring data acquired by Prometheus;
and acquiring alarm data acquired through an alarm rule defined by the AlertManager.
3. The operation and maintenance processing method for the container cloud platform according to claim 1, wherein the analyzing the monitoring data and the alarm data by the fault processing engine to determine rule plug-ins corresponding to the processing comprises:
analyzing the monitoring data and the alarm data through a fault processing engine to determine event characteristics corresponding to fault problems of the monitoring data and the alarm data;
and traversing all plug-ins in a chain of responsibility mode according to the event characteristics to determine regular plug-ins for processing the fault problems, wherein each plug-in corresponds to one fault problem and a corresponding solution.
4. The operation and maintenance processing method for the container cloud platform according to claim 1, further comprising:
and acquiring and storing the execution result of the Pipeline, and recording the processing information aiming at the fault problem.
5. The operation and maintenance processing method for the container cloud platform according to claim 1, wherein each task comprises a plurality of execution steps, and the execution steps are executed in a serial manner.
6. The operation and maintenance processing method for the container cloud platform according to claim 1, further comprising:
in the process of processing the fault problem, sending a local command to a Node agent, wherein the Node agent comprises Node nodes;
and executing the local command through the Node.
7. The operation and maintenance processing method for the container cloud platform according to claim 6, further comprising:
and under the condition that the Node can not be logged in or the Node is called without permission, calling a proxy interface to send the local command to the Node agent so as to execute the local command through the Node.
8. The operation and maintenance processing method for the container cloud platform according to any one of claims 1 to 7, wherein each plug-in has a corresponding container image, script and configuration file.
9. A processor configured to perform the operation and maintenance processing method for the container cloud platform according to any one of claims 1 to 8.
10. An operation and maintenance processing device for a container cloud platform, characterized by comprising the processor according to claim 9.
11. A machine-readable storage medium having instructions stored thereon, which when executed by a processor, cause the processor to be configured to perform the operation and maintenance processing method for a container cloud platform according to any one of claims 1 to 8.
12. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the operation and maintenance processing method for a container cloud platform according to any one of claims 1 to 8.
CN202111537711.9A 2021-12-15 2021-12-15 Operation and maintenance processing method and device for container cloud platform and processor Pending CN114398221A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111537711.9A CN114398221A (en) 2021-12-15 2021-12-15 Operation and maintenance processing method and device for container cloud platform and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111537711.9A CN114398221A (en) 2021-12-15 2021-12-15 Operation and maintenance processing method and device for container cloud platform and processor

Publications (1)

Publication Number Publication Date
CN114398221A true CN114398221A (en) 2022-04-26

Family

ID=81226988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111537711.9A Pending CN114398221A (en) 2021-12-15 2021-12-15 Operation and maintenance processing method and device for container cloud platform and processor

Country Status (1)

Country Link
CN (1) CN114398221A (en)

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
CN108768728B (en) Operation and maintenance task processing method and device, computer equipment and storage medium
CN107016480B (en) Task scheduling method, device and system
US10489232B1 (en) Data center diagnostic information
US20190036798A1 (en) Method and apparatus for node processing in distributed system
CN111090502B (en) Stream data task scheduling method and device
US11546233B2 (en) Virtual network function bus-based auto-registration
CN111897705B (en) Service state processing and model training method, device, equipment and storage medium
CN112015618A (en) Abnormity warning method and device
US10970109B1 (en) System, method, and computer program for managing a plurality of heterogeneous software robots to automate business processes
CN111314137A (en) Information communication network automation operation and maintenance method, device, storage medium and processor
CN110858166A (en) Application exception processing method and device, storage medium and processor
CN113760634A (en) Data processing method and device
CN114153646A (en) Operation and maintenance fault handling method and device, storage medium and processor
US20170068706A1 (en) Event-stream searching using compiled rule patterns
CN116938720A (en) Service control method and device based on service cluster, electronic equipment and medium
CN114500249B (en) Root cause positioning method and device
CN114398221A (en) Operation and maintenance processing method and device for container cloud platform and processor
CN113824590B (en) Method for predicting problem in micro service network, computer device, and storage medium
CN115756888A (en) Data processing method, processor, device and storage medium
CN115373886A (en) Service group container shutdown method, device, computer equipment and storage medium
CN115033412A (en) Task log merging method and device
CN114048113A (en) Data center monitoring alarm fault self-healing method and device and computer equipment
CN114691395A (en) Fault processing method and device, electronic equipment and storage medium
CN109067611B (en) Method, device, storage medium and processor for detecting communication state between systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination