CN113553238A

CN113553238A - Cloud platform resource exception automatic processing system and method

Info

Publication number: CN113553238A
Application number: CN202110834688.3A
Authority: CN
Inventors: 宋洪圆; 蔡卫卫; 谢涛涛; 宋伟
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-10-26
Anticipated expiration: 2041-07-23
Also published as: CN113553238B

Abstract

The invention discloses a system and a method for automatically processing cloud platform resource abnormity, belonging to the field of cloud platform abnormal resource query and automatic processing, aiming at solving the technical problems of quickly and accurately positioning abnormal resources, extracting problem logs for problem analysis and repairing simple abnormity, and adopting the technical scheme that: the system filters abnormal resources in a cloud platform by self-defining a rule template through a rule module, and logs abnormal information or resource information in a log module, derives a function adaptive to a cloud platform function computing component of each manufacturer through an abnormal processing module, and an operation and maintenance worker triggers function processing logic defined in function computing by using a server architecture according to the definition rule template, transmits corresponding parameters and executes abnormal resource repairing and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.

Description

Cloud platform resource exception automatic processing system and method

Technical Field

The invention relates to the field of abnormal resource query and automatic processing of a cloud platform, in particular to a system and a method for automatically processing abnormal resources of the cloud platform.

Background

Currently, cloud computing is in a rapid development stage, and technological industry innovation is continuously emerging. The enterprise cloud gradually becomes a trend, and with the sharp increase of cloud basic resources, how to efficiently manage and operate and maintain massive resources becomes an important problem which affects cloud providers and platform operation and maintenance personnel to solve.

Generally, a cloud computing manufacturer needs to perform security compliance check, tag check calibration, configuration, security baseline check, and the like on resources in a cloud platform at regular time, and meanwhile, due to platform exception and customer irregular operation, part of resources are in an abnormal state. For operation and maintenance personnel, it becomes more important to quickly and accurately locate abnormal resources, and extract problem logs for problem analysis and repair simple abnormalities.

At present, aiming at a cloud platform with less resources, abnormal resources in each project can be checked through manual execution of commands by operation and maintenance personnel; for an environment with hundreds of resources, manual checking becomes extremely difficult, and the environment exception resources are generally checked and processed by executing scripts. However, executing scripts increases the learning cost of the operation and maintenance personnel, and as the scripts increase, the code maintenance cost also increases. Meanwhile, a large number of scripts run on the cloud platform in a timed task mode, so that resource waste is caused invisibly, computing resources of physical equipment cannot be utilized to the maximum extent, the exception handling history executed by the scripts cannot be recorded easily, and problem reasons are traced and located. It is recommended to use a cloud platform hosting service to implement the function, and many cloud platforms provide compliance check services for resources, such as a conformation service of OpenStack, an AWS Config service, and the like. Taking open source openstack as an example, the congress grammar is similar to a functional writing method, is relatively complex, can also increase the learning cost of operation and maintenance personnel, has a single use scene, and is not maintained in the openstack community at present.

Disclosure of Invention

The invention provides a system and a method for automatically processing cloud platform resource abnormity, and aims to solve the problems of rapidly and accurately positioning abnormal resources, extracting problem logs for problem analysis and repairing simple abnormity.

The technical task of the invention is realized in the following way, the system is an automatic processing system for cloud platform resource abnormity, the system queries and filters log abnormal information in a log module or abnormal resources in a cloud platform through a rule module self-defined rule template, a function adaptive to cloud platform function computing components of various manufacturers is derived through an abnormal processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the defined rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.

Preferably, the system comprises, in combination,

the API module is used for inquiring the historical inquiry information and the abnormal processing information of the database or calling the rule module to inquire and process the abnormal resources;

the rule module is used for receiving the request, extracting target data and customizing a rule template;

the log module is used for calling the WORNING and ERROR abnormal logs of the platform collected by the elastic search and sdk filtering cloud platform Prometheus or Grafana;

the database module is used for recording query information, query results, trigger events and execution results to a database;

the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event; the exception handling template comprises a script library consisting of a plurality of exception scripts, the exception scripts are in one-to-one correspondence with actions in the rule template, and the exception scripts are triggered to handle corresponding exception data by triggering actions events;

and the cloud platform function computing component is used for executing corresponding function codes according to the triggered events, repairing abnormal problems or pushing abnormal resources and log information to operation and maintenance personnel for subsequent processing.

Preferably, the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;

the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;

filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;

actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.

Preferably, the template file of the rule template comprises a log rule template file, and the log resource template file comprises the following fields:

description: customizing the detail description of the query;

resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;

filters: defining conditions of a node where a filtering condition screening service is located and a log level;

actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result.

Preferably, the template file of the rule template further includes a resource rule template file, and the resource rule template file includes the following fields:

description: customizing the detail description of the query;

resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;

filters: filtering resource information through value filtering and regular matching screening;

actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.

An automatic processing method for cloud platform resource abnormity comprises the steps that operation and maintenance personnel self-define query attributes through a rule template, query and filter log abnormal information or resource information to filter abnormal resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing assembly to process the abnormal resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form.

Preferably, the method is specifically as follows:

the operation and maintenance personnel use the elasticsearch and chdk to inquire and filter abnormal information of the target component log by compiling resources, filters and actions indexes or use openstacksdk to inquire and filter resource information of the cloud platform by calling a log module, acquire abnormal resources and trigger an abnormal processing event;

the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to a database through a database module, and provides an API module for operation and maintenance personnel to inquire historical inquiry and handling operation;

the exception handling module records a mapping relation between a self-defined rule template and an exception handling template, and a developer develops an exception handling function according to a handling engine structure;

and the operation and maintenance personnel call the API module to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered to handle exception resources.

Preferably, the template files of the rule template comprise a log rule template file and a resource rule template file;

the log resource template file comprises the following fields:

description: customizing the detail description of the query;

actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, policies are newly established to define actions to process abnormal resources according to a query result;

the resource rule template file includes the following fields:

description: customizing the detail description of the query;

A computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer, the method for automatically processing cloud platform resource exception as described above is implemented.

The cloud platform resource abnormity automatic processing system and method provided by the invention have the following advantages:

the method comprises the steps of (I) acquiring abnormal resource information, triggering an event, repairing the abnormality or pushing a message to an operation and maintenance worker for manual processing by defining a rule index; the abnormal resource collection and learning cost of operation and maintenance personnel is reduced by defining a structured rule template, the operation and maintenance flow is optimized, the function computing mode is introduced, the abnormal processing script is managed in a standardized mode, the problem of failure can be filed and traced, the script processing frequency is refined, the original form of the timing task is changed into the form of combining manual inspection and the timing task, the cloud platform computing resource is further optimized, and the resource utilization rate is improved;

the invention refers to the open source project: the implementation idea of closed-custom realizes resource management by defining a simple rule strategy; meanwhile, an open source project closed-historical is expanded, and a function computing model is combined to provide an openstack cloud platform resource abnormity automatic processing method based on a function computing architecture, operation and maintenance personnel can self-define query attributes through a rule model, and abnormal logs or query resource information collected by a log module are filtered; and automatically generating a processing function according to the result, executing the corresponding function through a function computing component of the cloud platform, processing abnormal resources, sending a message for the complicated abnormality requiring manual intervention, and feeding the abnormal resource information and the inquired log back to an operation and maintenance manager for further analysis and processing.

The cloud-custom is a public cloud scene compliance inspection automation tool, is a YAML-based simple DSL language declarative cloud resource configuration baseline inspection tool, can retrieve cloud resources which do not conform to baseline configuration through standard YAML language definition rules, and can automatically correct the cloud resources, so that cloud infrastructure management is realized. But cloud-custom currently supports only AWS, Azure and GCP environments;

serverless is a cloud-native development model that allows developers to focus on building and running applications without the need for a management server. The method is a finer-grained Service architecture mode relative to micro-services, and further splits each API operation to be executed by a user, namely further splits operations such as creation, reading, deletion, updating and the like aiming at resources, each operation is abstracted into a Function, and servlets are issued in a form of directly exposing the functions, so the servlets are generally called as Function computing services (FaaS) and currently provide the Function computing services including AWS, Aliskiu, Tencent cloud, Wagnen cloud and the like; with the rise of a novel computing mode after Serverless follows micro service, the Serverless further splits operations such as creation, reading, deletion, updating and the like aiming at resources, each operation is abstracted into a function, and the exposed forms of the functions are directly issued. From the perspective of cloud computing, Serverless maximally utilizes computing resources, and resource idleness and fragmentation are reduced;

thirdly, the operation and maintenance personnel can inquire and process historical information by calling the API module, all information such as historical calling requests, abnormal data, processing events, processing results and the like can be filed in the database, and the RestFul API is exposed to the outside for the operation and maintenance personnel to track and process the history, analyze abnormal reasons and realize tracing of abnormal problems;

fourthly, the operation and maintenance personnel processing logic is simplified by defining a structured rule module; the method has the advantages that the collected logs and the resource information acquired by the calling platform openstacksdk are further classified and abstracted into information which can be inquired by a rule module, so that the learning cost of operation and maintenance personnel is reduced, the readability of the operation and maintenance personnel is improved, the operation and maintenance efficiency is greatly improved, and abnormal resources and services are accurately positioned;

acquiring information such as abnormal logs, platform resource state information, service states of the platform, physical machine system logs and virtual machine system logs by calling a monitoring log interface, a platform openstacksdk, a shell command and the like by the background, and abstracting corresponding calling into a structured rule module for operation and maintenance personnel to use, wherein abundant abnormal resource data sources cover cloud platform nodes, services and resources from log information of components, physical equipment and related services of the platform to resource state information such as virtual machines, images and volumes of the cloud platform acquired by calling the openstacksdk;

the method comprises the steps of (VI) exporting functions which can be used by a cloud platform function computing component by calling an API (application programming interface), managing exception handling scripts in a standardized mode, and mapping the exception handling scripts with actions in a rule engine to realize exception handling logic; the cloud platform function computing components are introduced to run the abnormal resource processing script in an event triggering mode, operation and maintenance personnel can call an API (application programming interface) through the abnormal processing module to export a function adaptive to each cloud platform function computing component, events in function computing are triggered by using a serverless architecture according to a self-defined rule template, corresponding parameters are introduced, and abnormal resource repairing and processing are executed;

the exception handling module records the mapping relation between the rule template and the exception handling template, developers can develop exception handling functions according to the processing engine structure without paying attention to the bottom layer operation physical environment, the development process is simplified, and the processing script can be maintained more conveniently through a service form; and the operation and maintenance personnel only need to call the API to export the function template of the adaptive function computing assembly and transmit the function template into the cloud platform function computing assembly, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered, exception resources are handled, the bottom layer implementation logic does not need to be concerned about, the learning cost is reduced, the operation and maintenance efficiency is improved, the operation history of the operation and maintenance personnel can be controlled and managed in a fine-grained manner through the cloud platform function computing assembly, exceptions are handled during calling, and the utilization rate of the cloud platform resources is optimized.

Drawings

The invention is further described below with reference to the accompanying drawings.

Fig. 1 is a structural block diagram of an automatic processing system for cloud platform resource exception.

Detailed Description

The automatic processing system and method for cloud platform resource exception according to the present invention will be described in detail below with reference to the drawings and specific embodiments of the specification.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description. And are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1:

the cloud platform resource abnormity automatic processing system provided by the invention is characterized in that a rule module self-defined rule template is used for inquiring and filtering log abnormal information in a log module or resource information to filter abnormal resources in a cloud platform, a function adaptive to cloud platform function computing components of various manufacturers is derived through an abnormity processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the definition rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized. As shown in fig. 1, the system includes,

The rule template in this embodiment is a simple DSL language declarative cloud resource configuration based on YAML; the rule template comprises resources, filters and actions;

The template file of the rule template in this embodiment includes a log rule template file, and the log resource template file includes the following fields:

description: customizing the detail description of the query;

actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result. The log rule template file is exemplified as follows:

the template file of the rule template in this embodiment further includes a resource rule template file, where the resource rule template file includes the following fields:

description: customizing the detail description of the query;

actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation. An example of a resource rule template file is as follows:

the working process of the system is as follows:

(1) the operation and maintenance personnel inquire the historical inquiry information and the abnormal processing information of the database by calling the API module or inquire and process the abnormal resources by calling the rule module;

(2) the rule module receives the request and extracts target data; the data source calls an elastic search and sdk filtering platform WORNING and ERROR abnormal log collected by a cloud platform Prometous or Grafana from a log module according to the request resources parameter; or calling openstacksdk from the bottom layer to inquire the information of the resource state of the cloud platform;

(3) the exception processing module analyzes the corresponding exception according to the query result and triggers the corresponding cloud platform function computing component event;

(4) and the cloud platform function computing component executes the corresponding function code according to the triggered event, and restores the abnormal problem or pushes the abnormal resource and the log information to the operation and maintenance personnel for subsequent processing.

Example 2:

the invention relates to a cloud platform resource exception automatic processing method, which is characterized in that operation and maintenance personnel self-define query attributes through a rule template, query and filter log exception information or resource information to filter exception resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing component to process the exception resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form. The method comprises the following specific steps:

s1, the operation and maintenance personnel use the elasticsearch sdk to inquire abnormal information of the log of the filtering target component by compiling resources, filters and actions indexes, or use the openstacksdk to inquire information of the cloud platform resource, obtain abnormal resources and trigger an abnormal processing event;

s2, the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to the database through the database module, and provides an API module for operation and maintenance personnel to inquire history inquiry and processing operation;

s3, the exception handling module records the mapping relation between the self-defined rule template and the exception handling template, and developers develop exception handling functions according to the processing engine structure;

and S4, calling the API module by the operation and maintenance personnel to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and defining corresponding actions attributes in the template file according to the document used by the engine, namely calling and triggering an exception handling function to handle exception resources.

The rule template of step S3 in this embodiment is a simple DSL language declarative cloud resource configuration based on YAML; the rule template comprises resources, filters and actions;

In this embodiment, the template file of step S4 includes a log rule template file and a resource rule template file;

the log resource template file comprises the following fields:

description: customizing the detail description of the query;

the resource rule template file includes the following fields:

description: customizing the detail description of the query;

Example 3:

the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the automatic processing method for the cloud platform resource exception in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An automatic processing system for cloud platform resource abnormity is characterized in that a rule template is customized by a rule module to inquire and filter log abnormal information in a log module or abnormal resources in a cloud platform by resource information, a function adaptive to cloud platform function computing components of various manufacturers is derived by an abnormal processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the definition rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.

2. The cloud platform resource exception handling system of claim 1, wherein the system comprises,

the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event;

3. The cloud platform resource exception handling system according to claim 1 or 2, wherein the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;

4. The cloud platform resource exception handling system according to claim 3, wherein the template file of the rule template comprises a log rule template file, and the log resource template file comprises the following fields:

description: customizing the detail description of the query;

5. The cloud platform resource exception handling system according to claim 4, wherein the template file of the rule template further comprises a resource rule template file, the resource rule template file comprising the following fields:

description: customizing the detail description of the query;

6. An automatic processing method for cloud platform resource abnormity is characterized in that operation and maintenance personnel self-define query attributes through a rule template, query and filter log abnormal information or resource information to filter abnormal resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing component to process the abnormal resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form.

7. The method for automatically processing the cloud platform resource exception according to claim 6, wherein the method specifically comprises the following steps:

8. The cloud platform resource exception handling method according to claim 6 or 7, wherein the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;

9. The cloud platform resource exception automatic processing method according to claim 8, wherein the template files of the rule template include a log rule template file and a resource rule template file;

the log resource template file comprises the following fields:

description: customizing the detail description of the query;

the resource rule template file includes the following fields:

description: customizing the detail description of the query;

10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and when a processor executes the computer, the method for automatically processing the cloud platform resource exception according to any one of claims 6 to 9 is implemented.