CN113553238A - Cloud platform resource exception automatic processing system and method - Google Patents

Cloud platform resource exception automatic processing system and method Download PDF

Info

Publication number
CN113553238A
CN113553238A CN202110834688.3A CN202110834688A CN113553238A CN 113553238 A CN113553238 A CN 113553238A CN 202110834688 A CN202110834688 A CN 202110834688A CN 113553238 A CN113553238 A CN 113553238A
Authority
CN
China
Prior art keywords
resource
log
abnormal
resources
cloud platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110834688.3A
Other languages
Chinese (zh)
Other versions
CN113553238B (en
Inventor
宋洪圆
蔡卫卫
谢涛涛
宋伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202110834688.3A priority Critical patent/CN113553238B/en
Publication of CN113553238A publication Critical patent/CN113553238A/en
Application granted granted Critical
Publication of CN113553238B publication Critical patent/CN113553238B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2289Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by configuration test
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a system and a method for automatically processing cloud platform resource abnormity, belonging to the field of cloud platform abnormal resource query and automatic processing, aiming at solving the technical problems of quickly and accurately positioning abnormal resources, extracting problem logs for problem analysis and repairing simple abnormity, and adopting the technical scheme that: the system filters abnormal resources in a cloud platform by self-defining a rule template through a rule module, and logs abnormal information or resource information in a log module, derives a function adaptive to a cloud platform function computing component of each manufacturer through an abnormal processing module, and an operation and maintenance worker triggers function processing logic defined in function computing by using a server architecture according to the definition rule template, transmits corresponding parameters and executes abnormal resource repairing and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.

Description

Cloud platform resource exception automatic processing system and method
Technical Field
The invention relates to the field of abnormal resource query and automatic processing of a cloud platform, in particular to a system and a method for automatically processing abnormal resources of the cloud platform.
Background
Currently, cloud computing is in a rapid development stage, and technological industry innovation is continuously emerging. The enterprise cloud gradually becomes a trend, and with the sharp increase of cloud basic resources, how to efficiently manage and operate and maintain massive resources becomes an important problem which affects cloud providers and platform operation and maintenance personnel to solve.
Generally, a cloud computing manufacturer needs to perform security compliance check, tag check calibration, configuration, security baseline check, and the like on resources in a cloud platform at regular time, and meanwhile, due to platform exception and customer irregular operation, part of resources are in an abnormal state. For operation and maintenance personnel, it becomes more important to quickly and accurately locate abnormal resources, and extract problem logs for problem analysis and repair simple abnormalities.
At present, aiming at a cloud platform with less resources, abnormal resources in each project can be checked through manual execution of commands by operation and maintenance personnel; for an environment with hundreds of resources, manual checking becomes extremely difficult, and the environment exception resources are generally checked and processed by executing scripts. However, executing scripts increases the learning cost of the operation and maintenance personnel, and as the scripts increase, the code maintenance cost also increases. Meanwhile, a large number of scripts run on the cloud platform in a timed task mode, so that resource waste is caused invisibly, computing resources of physical equipment cannot be utilized to the maximum extent, the exception handling history executed by the scripts cannot be recorded easily, and problem reasons are traced and located. It is recommended to use a cloud platform hosting service to implement the function, and many cloud platforms provide compliance check services for resources, such as a conformation service of OpenStack, an AWS Config service, and the like. Taking open source openstack as an example, the congress grammar is similar to a functional writing method, is relatively complex, can also increase the learning cost of operation and maintenance personnel, has a single use scene, and is not maintained in the openstack community at present.
Disclosure of Invention
The invention provides a system and a method for automatically processing cloud platform resource abnormity, and aims to solve the problems of rapidly and accurately positioning abnormal resources, extracting problem logs for problem analysis and repairing simple abnormity.
The technical task of the invention is realized in the following way, the system is an automatic processing system for cloud platform resource abnormity, the system queries and filters log abnormal information in a log module or abnormal resources in a cloud platform through a rule module self-defined rule template, a function adaptive to cloud platform function computing components of various manufacturers is derived through an abnormal processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the defined rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.
Preferably, the system comprises, in combination,
the API module is used for inquiring the historical inquiry information and the abnormal processing information of the database or calling the rule module to inquire and process the abnormal resources;
the rule module is used for receiving the request, extracting target data and customizing a rule template;
the log module is used for calling the WORNING and ERROR abnormal logs of the platform collected by the elastic search and sdk filtering cloud platform Prometheus or Grafana;
the database module is used for recording query information, query results, trigger events and execution results to a database;
the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event; the exception handling template comprises a script library consisting of a plurality of exception scripts, the exception scripts are in one-to-one correspondence with actions in the rule template, and the exception scripts are triggered to handle corresponding exception data by triggering actions events;
and the cloud platform function computing component is used for executing corresponding function codes according to the triggered events, repairing abnormal problems or pushing abnormal resources and log information to operation and maintenance personnel for subsequent processing.
Preferably, the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
Preferably, the template file of the rule template comprises a log rule template file, and the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result.
Preferably, the template file of the rule template further includes a resource rule template file, and the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
An automatic processing method for cloud platform resource abnormity comprises the steps that operation and maintenance personnel self-define query attributes through a rule template, query and filter log abnormal information or resource information to filter abnormal resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing assembly to process the abnormal resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form.
Preferably, the method is specifically as follows:
the operation and maintenance personnel use the elasticsearch and chdk to inquire and filter abnormal information of the target component log by compiling resources, filters and actions indexes or use openstacksdk to inquire and filter resource information of the cloud platform by calling a log module, acquire abnormal resources and trigger an abnormal processing event;
the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to a database through a database module, and provides an API module for operation and maintenance personnel to inquire historical inquiry and handling operation;
the exception handling module records a mapping relation between a self-defined rule template and an exception handling template, and a developer develops an exception handling function according to a handling engine structure;
and the operation and maintenance personnel call the API module to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered to handle exception resources.
Preferably, the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
Preferably, the template files of the rule template comprise a log rule template file and a resource rule template file;
the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, policies are newly established to define actions to process abnormal resources according to a query result;
the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
A computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer, the method for automatically processing cloud platform resource exception as described above is implemented.
The cloud platform resource abnormity automatic processing system and method provided by the invention have the following advantages:
the method comprises the steps of (I) acquiring abnormal resource information, triggering an event, repairing the abnormality or pushing a message to an operation and maintenance worker for manual processing by defining a rule index; the abnormal resource collection and learning cost of operation and maintenance personnel is reduced by defining a structured rule template, the operation and maintenance flow is optimized, the function computing mode is introduced, the abnormal processing script is managed in a standardized mode, the problem of failure can be filed and traced, the script processing frequency is refined, the original form of the timing task is changed into the form of combining manual inspection and the timing task, the cloud platform computing resource is further optimized, and the resource utilization rate is improved;
the invention refers to the open source project: the implementation idea of closed-custom realizes resource management by defining a simple rule strategy; meanwhile, an open source project closed-historical is expanded, and a function computing model is combined to provide an openstack cloud platform resource abnormity automatic processing method based on a function computing architecture, operation and maintenance personnel can self-define query attributes through a rule model, and abnormal logs or query resource information collected by a log module are filtered; and automatically generating a processing function according to the result, executing the corresponding function through a function computing component of the cloud platform, processing abnormal resources, sending a message for the complicated abnormality requiring manual intervention, and feeding the abnormal resource information and the inquired log back to an operation and maintenance manager for further analysis and processing.
The cloud-custom is a public cloud scene compliance inspection automation tool, is a YAML-based simple DSL language declarative cloud resource configuration baseline inspection tool, can retrieve cloud resources which do not conform to baseline configuration through standard YAML language definition rules, and can automatically correct the cloud resources, so that cloud infrastructure management is realized. But cloud-custom currently supports only AWS, Azure and GCP environments;
serverless is a cloud-native development model that allows developers to focus on building and running applications without the need for a management server. The method is a finer-grained Service architecture mode relative to micro-services, and further splits each API operation to be executed by a user, namely further splits operations such as creation, reading, deletion, updating and the like aiming at resources, each operation is abstracted into a Function, and servlets are issued in a form of directly exposing the functions, so the servlets are generally called as Function computing services (FaaS) and currently provide the Function computing services including AWS, Aliskiu, Tencent cloud, Wagnen cloud and the like; with the rise of a novel computing mode after Serverless follows micro service, the Serverless further splits operations such as creation, reading, deletion, updating and the like aiming at resources, each operation is abstracted into a function, and the exposed forms of the functions are directly issued. From the perspective of cloud computing, Serverless maximally utilizes computing resources, and resource idleness and fragmentation are reduced;
thirdly, the operation and maintenance personnel can inquire and process historical information by calling the API module, all information such as historical calling requests, abnormal data, processing events, processing results and the like can be filed in the database, and the RestFul API is exposed to the outside for the operation and maintenance personnel to track and process the history, analyze abnormal reasons and realize tracing of abnormal problems;
fourthly, the operation and maintenance personnel processing logic is simplified by defining a structured rule module; the method has the advantages that the collected logs and the resource information acquired by the calling platform openstacksdk are further classified and abstracted into information which can be inquired by a rule module, so that the learning cost of operation and maintenance personnel is reduced, the readability of the operation and maintenance personnel is improved, the operation and maintenance efficiency is greatly improved, and abnormal resources and services are accurately positioned;
acquiring information such as abnormal logs, platform resource state information, service states of the platform, physical machine system logs and virtual machine system logs by calling a monitoring log interface, a platform openstacksdk, a shell command and the like by the background, and abstracting corresponding calling into a structured rule module for operation and maintenance personnel to use, wherein abundant abnormal resource data sources cover cloud platform nodes, services and resources from log information of components, physical equipment and related services of the platform to resource state information such as virtual machines, images and volumes of the cloud platform acquired by calling the openstacksdk;
the method comprises the steps of (VI) exporting functions which can be used by a cloud platform function computing component by calling an API (application programming interface), managing exception handling scripts in a standardized mode, and mapping the exception handling scripts with actions in a rule engine to realize exception handling logic; the cloud platform function computing components are introduced to run the abnormal resource processing script in an event triggering mode, operation and maintenance personnel can call an API (application programming interface) through the abnormal processing module to export a function adaptive to each cloud platform function computing component, events in function computing are triggered by using a serverless architecture according to a self-defined rule template, corresponding parameters are introduced, and abnormal resource repairing and processing are executed;
the exception handling module records the mapping relation between the rule template and the exception handling template, developers can develop exception handling functions according to the processing engine structure without paying attention to the bottom layer operation physical environment, the development process is simplified, and the processing script can be maintained more conveniently through a service form; and the operation and maintenance personnel only need to call the API to export the function template of the adaptive function computing assembly and transmit the function template into the cloud platform function computing assembly, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered, exception resources are handled, the bottom layer implementation logic does not need to be concerned about, the learning cost is reduced, the operation and maintenance efficiency is improved, the operation history of the operation and maintenance personnel can be controlled and managed in a fine-grained manner through the cloud platform function computing assembly, exceptions are handled during calling, and the utilization rate of the cloud platform resources is optimized.
Drawings
The invention is further described below with reference to the accompanying drawings.
Fig. 1 is a structural block diagram of an automatic processing system for cloud platform resource exception.
Detailed Description
The automatic processing system and method for cloud platform resource exception according to the present invention will be described in detail below with reference to the drawings and specific embodiments of the specification.
In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description. And are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1:
the cloud platform resource abnormity automatic processing system provided by the invention is characterized in that a rule module self-defined rule template is used for inquiring and filtering log abnormal information in a log module or resource information to filter abnormal resources in a cloud platform, a function adaptive to cloud platform function computing components of various manufacturers is derived through an abnormity processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the definition rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized. As shown in fig. 1, the system includes,
the API module is used for inquiring the historical inquiry information and the abnormal processing information of the database or calling the rule module to inquire and process the abnormal resources;
the rule module is used for receiving the request, extracting target data and customizing a rule template;
the log module is used for calling the WORNING and ERROR abnormal logs of the platform collected by the elastic search and sdk filtering cloud platform Prometheus or Grafana;
the database module is used for recording query information, query results, trigger events and execution results to a database;
the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event; the exception handling template comprises a script library consisting of a plurality of exception scripts, the exception scripts are in one-to-one correspondence with actions in the rule template, and the exception scripts are triggered to handle corresponding exception data by triggering actions events;
and the cloud platform function computing component is used for executing corresponding function codes according to the triggered events, repairing abnormal problems or pushing abnormal resources and log information to operation and maintenance personnel for subsequent processing.
The rule template in this embodiment is a simple DSL language declarative cloud resource configuration based on YAML; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
The template file of the rule template in this embodiment includes a log rule template file, and the log resource template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result. The log rule template file is exemplified as follows:
Figure BDA0003176681060000091
Figure BDA0003176681060000101
the template file of the rule template in this embodiment further includes a resource rule template file, where the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation. An example of a resource rule template file is as follows:
Figure BDA0003176681060000102
Figure BDA0003176681060000111
the working process of the system is as follows:
(1) the operation and maintenance personnel inquire the historical inquiry information and the abnormal processing information of the database by calling the API module or inquire and process the abnormal resources by calling the rule module;
(2) the rule module receives the request and extracts target data; the data source calls an elastic search and sdk filtering platform WORNING and ERROR abnormal log collected by a cloud platform Prometous or Grafana from a log module according to the request resources parameter; or calling openstacksdk from the bottom layer to inquire the information of the resource state of the cloud platform;
(3) the exception processing module analyzes the corresponding exception according to the query result and triggers the corresponding cloud platform function computing component event;
(4) and the cloud platform function computing component executes the corresponding function code according to the triggered event, and restores the abnormal problem or pushes the abnormal resource and the log information to the operation and maintenance personnel for subsequent processing.
Example 2:
the invention relates to a cloud platform resource exception automatic processing method, which is characterized in that operation and maintenance personnel self-define query attributes through a rule template, query and filter log exception information or resource information to filter exception resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing component to process the exception resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form. The method comprises the following specific steps:
s1, the operation and maintenance personnel use the elasticsearch sdk to inquire abnormal information of the log of the filtering target component by compiling resources, filters and actions indexes, or use the openstacksdk to inquire information of the cloud platform resource, obtain abnormal resources and trigger an abnormal processing event;
s2, the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to the database through the database module, and provides an API module for operation and maintenance personnel to inquire history inquiry and processing operation;
s3, the exception handling module records the mapping relation between the self-defined rule template and the exception handling template, and developers develop exception handling functions according to the processing engine structure;
and S4, calling the API module by the operation and maintenance personnel to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and defining corresponding actions attributes in the template file according to the document used by the engine, namely calling and triggering an exception handling function to handle exception resources.
The rule template of step S3 in this embodiment is a simple DSL language declarative cloud resource configuration based on YAML; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
In this embodiment, the template file of step S4 includes a log rule template file and a resource rule template file;
the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, policies are newly established to define actions to process abnormal resources according to a query result;
the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
Example 3:
the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the automatic processing method for the cloud platform resource exception in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. An automatic processing system for cloud platform resource abnormity is characterized in that a rule template is customized by a rule module to inquire and filter log abnormal information in a log module or abnormal resources in a cloud platform by resource information, a function adaptive to cloud platform function computing components of various manufacturers is derived by an abnormal processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the definition rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.
2. The cloud platform resource exception handling system of claim 1, wherein the system comprises,
the API module is used for inquiring the historical inquiry information and the abnormal processing information of the database or calling the rule module to inquire and process the abnormal resources;
the rule module is used for receiving the request, extracting target data and customizing a rule template;
the log module is used for calling the WORNING and ERROR abnormal logs of the platform collected by the elastic search and sdk filtering cloud platform Prometheus or Grafana;
the database module is used for recording query information, query results, trigger events and execution results to a database;
the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event;
and the cloud platform function computing component is used for executing corresponding function codes according to the triggered events, repairing abnormal problems or pushing abnormal resources and log information to operation and maintenance personnel for subsequent processing.
3. The cloud platform resource exception handling system according to claim 1 or 2, wherein the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
4. The cloud platform resource exception handling system according to claim 3, wherein the template file of the rule template comprises a log rule template file, and the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result.
5. The cloud platform resource exception handling system according to claim 4, wherein the template file of the rule template further comprises a resource rule template file, the resource rule template file comprising the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
6. An automatic processing method for cloud platform resource abnormity is characterized in that operation and maintenance personnel self-define query attributes through a rule template, query and filter log abnormal information or resource information to filter abnormal resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing component to process the abnormal resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form.
7. The method for automatically processing the cloud platform resource exception according to claim 6, wherein the method specifically comprises the following steps:
the operation and maintenance personnel use the elasticsearch and chdk to inquire and filter abnormal information of the target component log by compiling resources, filters and actions indexes or use openstacksdk to inquire and filter resource information of the cloud platform by calling a log module, acquire abnormal resources and trigger an abnormal processing event;
the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to a database through a database module, and provides an API module for operation and maintenance personnel to inquire historical inquiry and handling operation;
the exception handling module records a mapping relation between a self-defined rule template and an exception handling template, and a developer develops an exception handling function according to a handling engine structure;
and the operation and maintenance personnel call the API module to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered to handle exception resources.
8. The cloud platform resource exception handling method according to claim 6 or 7, wherein the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
9. The cloud platform resource exception automatic processing method according to claim 8, wherein the template files of the rule template include a log rule template file and a resource rule template file;
the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, policies are newly established to define actions to process abnormal resources according to a query result;
the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and when a processor executes the computer, the method for automatically processing the cloud platform resource exception according to any one of claims 6 to 9 is implemented.
CN202110834688.3A 2021-07-23 2021-07-23 Cloud platform resource abnormity automatic processing system and method Active CN113553238B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110834688.3A CN113553238B (en) 2021-07-23 2021-07-23 Cloud platform resource abnormity automatic processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110834688.3A CN113553238B (en) 2021-07-23 2021-07-23 Cloud platform resource abnormity automatic processing system and method

Publications (2)

Publication Number Publication Date
CN113553238A true CN113553238A (en) 2021-10-26
CN113553238B CN113553238B (en) 2024-09-20

Family

ID=78104148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110834688.3A Active CN113553238B (en) 2021-07-23 2021-07-23 Cloud platform resource abnormity automatic processing system and method

Country Status (1)

Country Link
CN (1) CN113553238B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114257495A (en) * 2021-11-16 2022-03-29 国家电网有限公司客户服务中心 Automatic processing system for abnormity of cloud platform computing node
CN114979100A (en) * 2022-04-15 2022-08-30 深信服科技股份有限公司 Cloud resource checking method and related device
CN115762090A (en) * 2022-12-05 2023-03-07 中信银行股份有限公司 Financial-level system intelligent monitoring and early warning method and system based on convolutional neural network
CN116360797A (en) * 2023-06-02 2023-06-30 北京长亭科技有限公司 DSL-based security baseline creation method, system, device and medium
CN116643950A (en) * 2023-07-19 2023-08-25 浩鲸云计算科技股份有限公司 FaaS-based cloud native application automatic operation and maintenance method
CN116755992A (en) * 2023-08-17 2023-09-15 青岛民航凯亚系统集成有限公司 Log analysis method and system based on OpenStack cloud computing
CN118227424A (en) * 2024-05-24 2024-06-21 山东浪潮数字商业科技有限公司 Log record template access method and system applied to distributed component

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015018164A1 (en) * 2013-08-08 2015-02-12 中国科学院计算机网络信息中心 Method for actively obtaining data from heterogeneous enterprise information system
EP3454214A1 (en) * 2017-09-08 2019-03-13 Accenture Global Solutions Limited Infrastructure instantiation, collaboration, and validation architecture for serverless execution frameworks
CN109471846A (en) * 2018-11-02 2019-03-15 中国电子科技网络信息安全有限公司 User behavior auditing system and method on a kind of cloud based on cloud log analysis
CN112769605A (en) * 2020-12-30 2021-05-07 杭州东方通信软件技术有限公司 Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015018164A1 (en) * 2013-08-08 2015-02-12 中国科学院计算机网络信息中心 Method for actively obtaining data from heterogeneous enterprise information system
EP3454214A1 (en) * 2017-09-08 2019-03-13 Accenture Global Solutions Limited Infrastructure instantiation, collaboration, and validation architecture for serverless execution frameworks
CN109471846A (en) * 2018-11-02 2019-03-15 中国电子科技网络信息安全有限公司 User behavior auditing system and method on a kind of cloud based on cloud log analysis
CN112769605A (en) * 2020-12-30 2021-05-07 杭州东方通信软件技术有限公司 Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
VETHANAYAGAM 等: "Threat Identification from Access Logs Using Elastic Stack", COMPUTER SCIENCE MASTERS PAPERS, 30 November 2020 (2020-11-30) *
郭鹏程;李迎春;付春燕;苏云霞;曹炳尧;: "基于ELK的视频会议设备日志管理分析系统", 工业控制计算机, no. 08, 25 August 2017 (2017-08-25) *
黎德生;金连文;李磊;李小宁;: "基于运行信息机制的OpenStack云平台容错改进方案", 华中科技大学学报(自然科学版), no. 1, 15 December 2012 (2012-12-15) *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114257495A (en) * 2021-11-16 2022-03-29 国家电网有限公司客户服务中心 Automatic processing system for abnormity of cloud platform computing node
CN114979100A (en) * 2022-04-15 2022-08-30 深信服科技股份有限公司 Cloud resource checking method and related device
CN114979100B (en) * 2022-04-15 2024-02-23 深信服科技股份有限公司 Cloud resource inspection method and related device
CN115762090A (en) * 2022-12-05 2023-03-07 中信银行股份有限公司 Financial-level system intelligent monitoring and early warning method and system based on convolutional neural network
CN116360797A (en) * 2023-06-02 2023-06-30 北京长亭科技有限公司 DSL-based security baseline creation method, system, device and medium
CN116360797B (en) * 2023-06-02 2023-10-27 北京长亭科技有限公司 DSL-based security baseline creation method, system, device and medium
CN116643950A (en) * 2023-07-19 2023-08-25 浩鲸云计算科技股份有限公司 FaaS-based cloud native application automatic operation and maintenance method
CN116643950B (en) * 2023-07-19 2023-10-20 浩鲸云计算科技股份有限公司 FaaS-based cloud native application automatic operation and maintenance method
CN116755992A (en) * 2023-08-17 2023-09-15 青岛民航凯亚系统集成有限公司 Log analysis method and system based on OpenStack cloud computing
CN116755992B (en) * 2023-08-17 2023-12-01 青岛民航凯亚系统集成有限公司 Log analysis method and system based on OpenStack cloud computing
CN118227424A (en) * 2024-05-24 2024-06-21 山东浪潮数字商业科技有限公司 Log record template access method and system applied to distributed component
CN118227424B (en) * 2024-05-24 2024-07-26 山东浪潮数字商业科技有限公司 Log record template access method and system applied to distributed component

Also Published As

Publication number Publication date
CN113553238B (en) 2024-09-20

Similar Documents

Publication Publication Date Title
CN113553238A (en) Cloud platform resource exception automatic processing system and method
US11151083B2 (en) Generating target application packages for groups of computing devices
US9753826B2 (en) Providing fault injection to cloud-provisioned machines
JP2023500228A (en) ML-based event handling
US8713526B2 (en) Assigning runtime artifacts to software components
CN106227654B (en) A kind of test platform
WO2004021207A1 (en) Systems and methods for improving service delivery
WO2014120192A1 (en) Error developer association
CN113656245B (en) Data inspection method and device, storage medium and processor
CN110489310B (en) Method and device for recording user operation, storage medium and computer equipment
US8024706B1 (en) Techniques for embedding testing or debugging features within a service
CN112068981B (en) Knowledge base-based fault scanning recovery method and system in Linux operating system
CN115658529A (en) Automatic testing method for user page and related equipment
CN115033419A (en) Method and system for realizing hardware fault self-healing
CN103026337A (en) Distillation and reconstruction of provisioning components
CN116610567A (en) Early warning method and device for abnormal application program, processor and electronic equipment
KR20230062761A (en) System hindrance integration management method
CN114297961A (en) Chip test case processing method and related device
CN112631763A (en) Program changing method and device of host program
CN110489256B (en) Downtime positioning and repairing method and system
KR100496958B1 (en) System hindrance integration management method
CN116880872A (en) Cluster firmware combination upgrading method, system, terminal and storage medium
CN115544518A (en) Vulnerability scanning engine implementation method and device, vulnerability scanning method and electronic equipment
CN115576816A (en) Linux operating system-based android application function automatic testing method and device
CN114064510A (en) Function testing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant