CN113553238A - Cloud platform resource exception automatic processing system and method - Google Patents
Cloud platform resource exception automatic processing system and method Download PDFInfo
- Publication number
- CN113553238A CN113553238A CN202110834688.3A CN202110834688A CN113553238A CN 113553238 A CN113553238 A CN 113553238A CN 202110834688 A CN202110834688 A CN 202110834688A CN 113553238 A CN113553238 A CN 113553238A
- Authority
- CN
- China
- Prior art keywords
- resource
- log
- abnormal
- resources
- cloud platform
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000002159 abnormal effect Effects 0.000 claims abstract description 108
- 238000012423 maintenance Methods 0.000 claims abstract description 56
- 238000004458 analytical method Methods 0.000 claims abstract description 10
- 230000003044 adaptive effect Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 87
- 238000001914 filtration Methods 0.000 claims description 48
- 230000000875 corresponding effect Effects 0.000 claims description 40
- 230000008569 process Effects 0.000 claims description 17
- 238000003672 processing method Methods 0.000 claims description 12
- 238000012216 screening Methods 0.000 claims description 12
- 230000001960 triggered effect Effects 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000008439 repair process Effects 0.000 claims description 4
- 238000013515 script Methods 0.000 description 19
- 230000005856 abnormality Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2289—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by configuration test
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3089—Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
- G06F11/3093—Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Hardware Design (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a system and a method for automatically processing cloud platform resource abnormity, belonging to the field of cloud platform abnormal resource query and automatic processing, aiming at solving the technical problems of quickly and accurately positioning abnormal resources, extracting problem logs for problem analysis and repairing simple abnormity, and adopting the technical scheme that: the system filters abnormal resources in a cloud platform by self-defining a rule template through a rule module, and logs abnormal information or resource information in a log module, derives a function adaptive to a cloud platform function computing component of each manufacturer through an abnormal processing module, and an operation and maintenance worker triggers function processing logic defined in function computing by using a server architecture according to the definition rule template, transmits corresponding parameters and executes abnormal resource repairing and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.
Description
Technical Field
The invention relates to the field of abnormal resource query and automatic processing of a cloud platform, in particular to a system and a method for automatically processing abnormal resources of the cloud platform.
Background
Currently, cloud computing is in a rapid development stage, and technological industry innovation is continuously emerging. The enterprise cloud gradually becomes a trend, and with the sharp increase of cloud basic resources, how to efficiently manage and operate and maintain massive resources becomes an important problem which affects cloud providers and platform operation and maintenance personnel to solve.
Generally, a cloud computing manufacturer needs to perform security compliance check, tag check calibration, configuration, security baseline check, and the like on resources in a cloud platform at regular time, and meanwhile, due to platform exception and customer irregular operation, part of resources are in an abnormal state. For operation and maintenance personnel, it becomes more important to quickly and accurately locate abnormal resources, and extract problem logs for problem analysis and repair simple abnormalities.
At present, aiming at a cloud platform with less resources, abnormal resources in each project can be checked through manual execution of commands by operation and maintenance personnel; for an environment with hundreds of resources, manual checking becomes extremely difficult, and the environment exception resources are generally checked and processed by executing scripts. However, executing scripts increases the learning cost of the operation and maintenance personnel, and as the scripts increase, the code maintenance cost also increases. Meanwhile, a large number of scripts run on the cloud platform in a timed task mode, so that resource waste is caused invisibly, computing resources of physical equipment cannot be utilized to the maximum extent, the exception handling history executed by the scripts cannot be recorded easily, and problem reasons are traced and located. It is recommended to use a cloud platform hosting service to implement the function, and many cloud platforms provide compliance check services for resources, such as a conformation service of OpenStack, an AWS Config service, and the like. Taking open source openstack as an example, the congress grammar is similar to a functional writing method, is relatively complex, can also increase the learning cost of operation and maintenance personnel, has a single use scene, and is not maintained in the openstack community at present.
Disclosure of Invention
The invention provides a system and a method for automatically processing cloud platform resource abnormity, and aims to solve the problems of rapidly and accurately positioning abnormal resources, extracting problem logs for problem analysis and repairing simple abnormity.
The technical task of the invention is realized in the following way, the system is an automatic processing system for cloud platform resource abnormity, the system queries and filters log abnormal information in a log module or abnormal resources in a cloud platform through a rule module self-defined rule template, a function adaptive to cloud platform function computing components of various manufacturers is derived through an abnormal processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the defined rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.
Preferably, the system comprises, in combination,
the API module is used for inquiring the historical inquiry information and the abnormal processing information of the database or calling the rule module to inquire and process the abnormal resources;
the rule module is used for receiving the request, extracting target data and customizing a rule template;
the log module is used for calling the WORNING and ERROR abnormal logs of the platform collected by the elastic search and sdk filtering cloud platform Prometheus or Grafana;
the database module is used for recording query information, query results, trigger events and execution results to a database;
the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event; the exception handling template comprises a script library consisting of a plurality of exception scripts, the exception scripts are in one-to-one correspondence with actions in the rule template, and the exception scripts are triggered to handle corresponding exception data by triggering actions events;
and the cloud platform function computing component is used for executing corresponding function codes according to the triggered events, repairing abnormal problems or pushing abnormal resources and log information to operation and maintenance personnel for subsequent processing.
Preferably, the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
Preferably, the template file of the rule template comprises a log rule template file, and the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result.
Preferably, the template file of the rule template further includes a resource rule template file, and the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
An automatic processing method for cloud platform resource abnormity comprises the steps that operation and maintenance personnel self-define query attributes through a rule template, query and filter log abnormal information or resource information to filter abnormal resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing assembly to process the abnormal resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form.
Preferably, the method is specifically as follows:
the operation and maintenance personnel use the elasticsearch and chdk to inquire and filter abnormal information of the target component log by compiling resources, filters and actions indexes or use openstacksdk to inquire and filter resource information of the cloud platform by calling a log module, acquire abnormal resources and trigger an abnormal processing event;
the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to a database through a database module, and provides an API module for operation and maintenance personnel to inquire historical inquiry and handling operation;
the exception handling module records a mapping relation between a self-defined rule template and an exception handling template, and a developer develops an exception handling function according to a handling engine structure;
and the operation and maintenance personnel call the API module to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered to handle exception resources.
Preferably, the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
Preferably, the template files of the rule template comprise a log rule template file and a resource rule template file;
the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, policies are newly established to define actions to process abnormal resources according to a query result;
the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
A computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer, the method for automatically processing cloud platform resource exception as described above is implemented.
The cloud platform resource abnormity automatic processing system and method provided by the invention have the following advantages:
the method comprises the steps of (I) acquiring abnormal resource information, triggering an event, repairing the abnormality or pushing a message to an operation and maintenance worker for manual processing by defining a rule index; the abnormal resource collection and learning cost of operation and maintenance personnel is reduced by defining a structured rule template, the operation and maintenance flow is optimized, the function computing mode is introduced, the abnormal processing script is managed in a standardized mode, the problem of failure can be filed and traced, the script processing frequency is refined, the original form of the timing task is changed into the form of combining manual inspection and the timing task, the cloud platform computing resource is further optimized, and the resource utilization rate is improved;
the invention refers to the open source project: the implementation idea of closed-custom realizes resource management by defining a simple rule strategy; meanwhile, an open source project closed-historical is expanded, and a function computing model is combined to provide an openstack cloud platform resource abnormity automatic processing method based on a function computing architecture, operation and maintenance personnel can self-define query attributes through a rule model, and abnormal logs or query resource information collected by a log module are filtered; and automatically generating a processing function according to the result, executing the corresponding function through a function computing component of the cloud platform, processing abnormal resources, sending a message for the complicated abnormality requiring manual intervention, and feeding the abnormal resource information and the inquired log back to an operation and maintenance manager for further analysis and processing.
The cloud-custom is a public cloud scene compliance inspection automation tool, is a YAML-based simple DSL language declarative cloud resource configuration baseline inspection tool, can retrieve cloud resources which do not conform to baseline configuration through standard YAML language definition rules, and can automatically correct the cloud resources, so that cloud infrastructure management is realized. But cloud-custom currently supports only AWS, Azure and GCP environments;
serverless is a cloud-native development model that allows developers to focus on building and running applications without the need for a management server. The method is a finer-grained Service architecture mode relative to micro-services, and further splits each API operation to be executed by a user, namely further splits operations such as creation, reading, deletion, updating and the like aiming at resources, each operation is abstracted into a Function, and servlets are issued in a form of directly exposing the functions, so the servlets are generally called as Function computing services (FaaS) and currently provide the Function computing services including AWS, Aliskiu, Tencent cloud, Wagnen cloud and the like; with the rise of a novel computing mode after Serverless follows micro service, the Serverless further splits operations such as creation, reading, deletion, updating and the like aiming at resources, each operation is abstracted into a function, and the exposed forms of the functions are directly issued. From the perspective of cloud computing, Serverless maximally utilizes computing resources, and resource idleness and fragmentation are reduced;
thirdly, the operation and maintenance personnel can inquire and process historical information by calling the API module, all information such as historical calling requests, abnormal data, processing events, processing results and the like can be filed in the database, and the RestFul API is exposed to the outside for the operation and maintenance personnel to track and process the history, analyze abnormal reasons and realize tracing of abnormal problems;
fourthly, the operation and maintenance personnel processing logic is simplified by defining a structured rule module; the method has the advantages that the collected logs and the resource information acquired by the calling platform openstacksdk are further classified and abstracted into information which can be inquired by a rule module, so that the learning cost of operation and maintenance personnel is reduced, the readability of the operation and maintenance personnel is improved, the operation and maintenance efficiency is greatly improved, and abnormal resources and services are accurately positioned;
acquiring information such as abnormal logs, platform resource state information, service states of the platform, physical machine system logs and virtual machine system logs by calling a monitoring log interface, a platform openstacksdk, a shell command and the like by the background, and abstracting corresponding calling into a structured rule module for operation and maintenance personnel to use, wherein abundant abnormal resource data sources cover cloud platform nodes, services and resources from log information of components, physical equipment and related services of the platform to resource state information such as virtual machines, images and volumes of the cloud platform acquired by calling the openstacksdk;
the method comprises the steps of (VI) exporting functions which can be used by a cloud platform function computing component by calling an API (application programming interface), managing exception handling scripts in a standardized mode, and mapping the exception handling scripts with actions in a rule engine to realize exception handling logic; the cloud platform function computing components are introduced to run the abnormal resource processing script in an event triggering mode, operation and maintenance personnel can call an API (application programming interface) through the abnormal processing module to export a function adaptive to each cloud platform function computing component, events in function computing are triggered by using a serverless architecture according to a self-defined rule template, corresponding parameters are introduced, and abnormal resource repairing and processing are executed;
the exception handling module records the mapping relation between the rule template and the exception handling template, developers can develop exception handling functions according to the processing engine structure without paying attention to the bottom layer operation physical environment, the development process is simplified, and the processing script can be maintained more conveniently through a service form; and the operation and maintenance personnel only need to call the API to export the function template of the adaptive function computing assembly and transmit the function template into the cloud platform function computing assembly, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered, exception resources are handled, the bottom layer implementation logic does not need to be concerned about, the learning cost is reduced, the operation and maintenance efficiency is improved, the operation history of the operation and maintenance personnel can be controlled and managed in a fine-grained manner through the cloud platform function computing assembly, exceptions are handled during calling, and the utilization rate of the cloud platform resources is optimized.
Drawings
The invention is further described below with reference to the accompanying drawings.
Fig. 1 is a structural block diagram of an automatic processing system for cloud platform resource exception.
Detailed Description
The automatic processing system and method for cloud platform resource exception according to the present invention will be described in detail below with reference to the drawings and specific embodiments of the specification.
In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description. And are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1:
the cloud platform resource abnormity automatic processing system provided by the invention is characterized in that a rule module self-defined rule template is used for inquiring and filtering log abnormal information in a log module or resource information to filter abnormal resources in a cloud platform, a function adaptive to cloud platform function computing components of various manufacturers is derived through an abnormity processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the definition rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized. As shown in fig. 1, the system includes,
the API module is used for inquiring the historical inquiry information and the abnormal processing information of the database or calling the rule module to inquire and process the abnormal resources;
the rule module is used for receiving the request, extracting target data and customizing a rule template;
the log module is used for calling the WORNING and ERROR abnormal logs of the platform collected by the elastic search and sdk filtering cloud platform Prometheus or Grafana;
the database module is used for recording query information, query results, trigger events and execution results to a database;
the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event; the exception handling template comprises a script library consisting of a plurality of exception scripts, the exception scripts are in one-to-one correspondence with actions in the rule template, and the exception scripts are triggered to handle corresponding exception data by triggering actions events;
and the cloud platform function computing component is used for executing corresponding function codes according to the triggered events, repairing abnormal problems or pushing abnormal resources and log information to operation and maintenance personnel for subsequent processing.
The rule template in this embodiment is a simple DSL language declarative cloud resource configuration based on YAML; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
The template file of the rule template in this embodiment includes a log rule template file, and the log resource template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result. The log rule template file is exemplified as follows:
the template file of the rule template in this embodiment further includes a resource rule template file, where the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation. An example of a resource rule template file is as follows:
the working process of the system is as follows:
(1) the operation and maintenance personnel inquire the historical inquiry information and the abnormal processing information of the database by calling the API module or inquire and process the abnormal resources by calling the rule module;
(2) the rule module receives the request and extracts target data; the data source calls an elastic search and sdk filtering platform WORNING and ERROR abnormal log collected by a cloud platform Prometous or Grafana from a log module according to the request resources parameter; or calling openstacksdk from the bottom layer to inquire the information of the resource state of the cloud platform;
(3) the exception processing module analyzes the corresponding exception according to the query result and triggers the corresponding cloud platform function computing component event;
(4) and the cloud platform function computing component executes the corresponding function code according to the triggered event, and restores the abnormal problem or pushes the abnormal resource and the log information to the operation and maintenance personnel for subsequent processing.
Example 2:
the invention relates to a cloud platform resource exception automatic processing method, which is characterized in that operation and maintenance personnel self-define query attributes through a rule template, query and filter log exception information or resource information to filter exception resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing component to process the exception resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form. The method comprises the following specific steps:
s1, the operation and maintenance personnel use the elasticsearch sdk to inquire abnormal information of the log of the filtering target component by compiling resources, filters and actions indexes, or use the openstacksdk to inquire information of the cloud platform resource, obtain abnormal resources and trigger an abnormal processing event;
s2, the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to the database through the database module, and provides an API module for operation and maintenance personnel to inquire history inquiry and processing operation;
s3, the exception handling module records the mapping relation between the self-defined rule template and the exception handling template, and developers develop exception handling functions according to the processing engine structure;
and S4, calling the API module by the operation and maintenance personnel to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and defining corresponding actions attributes in the template file according to the document used by the engine, namely calling and triggering an exception handling function to handle exception resources.
The rule template of step S3 in this embodiment is a simple DSL language declarative cloud resource configuration based on YAML; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
In this embodiment, the template file of step S4 includes a log rule template file and a resource rule template file;
the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, policies are newly established to define actions to process abnormal resources according to a query result;
the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
Example 3:
the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the automatic processing method for the cloud platform resource exception in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. An automatic processing system for cloud platform resource abnormity is characterized in that a rule template is customized by a rule module to inquire and filter log abnormal information in a log module or abnormal resources in a cloud platform by resource information, a function adaptive to cloud platform function computing components of various manufacturers is derived by an abnormal processing module, and operation and maintenance personnel trigger function processing logic defined in function computing by using a serverless architecture according to the definition rule template, transmit corresponding parameters and execute abnormal resource repair and processing; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form; meanwhile, the record information in the database module is inquired by calling the API module, so that the backtracking of the historical abnormal problems is realized.
2. The cloud platform resource exception handling system of claim 1, wherein the system comprises,
the API module is used for inquiring the historical inquiry information and the abnormal processing information of the database or calling the rule module to inquire and process the abnormal resources;
the rule module is used for receiving the request, extracting target data and customizing a rule template;
the log module is used for calling the WORNING and ERROR abnormal logs of the platform collected by the elastic search and sdk filtering cloud platform Prometheus or Grafana;
the database module is used for recording query information, query results, trigger events and execution results to a database;
the exception handling module is used for recording the mapping relation between the self-defined rule template and the exception handling template, analyzing the corresponding exception resource and triggering the corresponding function computing component event;
and the cloud platform function computing component is used for executing corresponding function codes according to the triggered events, repairing abnormal problems or pushing abnormal resources and log information to operation and maintenance personnel for subsequent processing.
3. The cloud platform resource exception handling system according to claim 1 or 2, wherein the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
4. The cloud platform resource exception handling system according to claim 3, wherein the template file of the rule template comprises a log rule template file, and the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, actions defined by policies are created to process abnormal resources according to a query result.
5. The cloud platform resource exception handling system according to claim 4, wherein the template file of the rule template further comprises a resource rule template file, the resource rule template file comprising the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
6. An automatic processing method for cloud platform resource abnormity is characterized in that operation and maintenance personnel self-define query attributes through a rule template, query and filter log abnormal information or resource information to filter abnormal resources in a cloud platform, automatically generate a processing function according to a result, and execute a corresponding function through a cloud platform function computing component to process the abnormal resources; or the error log, the related abnormal resources and the related information are fed back to the operation and maintenance personnel for manual processing in a message form.
7. The method for automatically processing the cloud platform resource exception according to claim 6, wherein the method specifically comprises the following steps:
the operation and maintenance personnel use the elasticsearch and chdk to inquire and filter abnormal information of the target component log by compiling resources, filters and actions indexes or use openstacksdk to inquire and filter resource information of the cloud platform by calling a log module, acquire abnormal resources and trigger an abnormal processing event;
the background of the exception handling module records each inquiry information, inquiry result, trigger event and execution result to a database through a database module, and provides an API module for operation and maintenance personnel to inquire historical inquiry and handling operation;
the exception handling module records a mapping relation between a self-defined rule template and an exception handling template, and a developer develops an exception handling function according to a handling engine structure;
and the operation and maintenance personnel call the API module to export a function template of the adaptive function computing component and transmit the function template into the cloud platform function computing component, and define corresponding actions attributes in the template file according to the engine use document, so that an exception handling function can be called and triggered to handle exception resources.
8. The cloud platform resource exception handling method according to claim 6 or 7, wherein the rule template is a YAML-based simple DSL language declarative cloud resource configuration; the rule template comprises resources, filters and actions;
the resources define resource types, and resource sources comprise cloud platform logs and resource information inquired through an API module;
filters define a method for filtering resources, wherein the method for filtering resources comprises common value filtering and regular matching;
actions defines operations on abnormal resources, and for resource selection of the log ERROR, actions are manually executed after abnormal log information is inquired and analyzed.
9. The cloud platform resource exception automatic processing method according to claim 8, wherein the template files of the rule template include a log rule template file and a resource rule template file;
the log resource template file comprises the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: identifying the resource type by using openstack.log < component name > < service name >, and simultaneously supporting query filtering of cloud platform related services and physical machine logs of rabbitmq.log, mysql.log and system.log;
filters: defining conditions of a node where a filtering condition screening service is located and a log level;
actions: the log query processing is generally set as waiting, which means that only log information is filtered when the log query processing is not processed, and after analysis, policies are newly established to define actions to process abnormal resources according to a query result;
the resource rule template file includes the following fields:
name: self-defining the name of the query;
description: customizing the detail description of the query;
resource: the service engine is internally provided with resource types which can be obtained through openstacksdk, and the reference use document can be matched with corresponding resources;
filters: filtering resource information through value filtering and regular matching screening;
actions: the service engine is internally provided with a resource processing method operable through openstacksdk, the requirements of operation and maintenance personnel can be matched by referring to a use document, and abnormal resources are processed corresponding to events defined in the trigger function calculation.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and when a processor executes the computer, the method for automatically processing the cloud platform resource exception according to any one of claims 6 to 9 is implemented.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110834688.3A CN113553238B (en) | 2021-07-23 | 2021-07-23 | Cloud platform resource abnormity automatic processing system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110834688.3A CN113553238B (en) | 2021-07-23 | 2021-07-23 | Cloud platform resource abnormity automatic processing system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113553238A true CN113553238A (en) | 2021-10-26 |
CN113553238B CN113553238B (en) | 2024-09-20 |
Family
ID=78104148
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110834688.3A Active CN113553238B (en) | 2021-07-23 | 2021-07-23 | Cloud platform resource abnormity automatic processing system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113553238B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114257495A (en) * | 2021-11-16 | 2022-03-29 | 国家电网有限公司客户服务中心 | Automatic processing system for abnormity of cloud platform computing node |
CN114979100A (en) * | 2022-04-15 | 2022-08-30 | 深信服科技股份有限公司 | Cloud resource checking method and related device |
CN115762090A (en) * | 2022-12-05 | 2023-03-07 | 中信银行股份有限公司 | Financial-level system intelligent monitoring and early warning method and system based on convolutional neural network |
CN116360797A (en) * | 2023-06-02 | 2023-06-30 | 北京长亭科技有限公司 | DSL-based security baseline creation method, system, device and medium |
CN116643950A (en) * | 2023-07-19 | 2023-08-25 | 浩鲸云计算科技股份有限公司 | FaaS-based cloud native application automatic operation and maintenance method |
CN116755992A (en) * | 2023-08-17 | 2023-09-15 | 青岛民航凯亚系统集成有限公司 | Log analysis method and system based on OpenStack cloud computing |
CN118227424A (en) * | 2024-05-24 | 2024-06-21 | 山东浪潮数字商业科技有限公司 | Log record template access method and system applied to distributed component |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015018164A1 (en) * | 2013-08-08 | 2015-02-12 | 中国科学院计算机网络信息中心 | Method for actively obtaining data from heterogeneous enterprise information system |
EP3454214A1 (en) * | 2017-09-08 | 2019-03-13 | Accenture Global Solutions Limited | Infrastructure instantiation, collaboration, and validation architecture for serverless execution frameworks |
CN109471846A (en) * | 2018-11-02 | 2019-03-15 | 中国电子科技网络信息安全有限公司 | User behavior auditing system and method on a kind of cloud based on cloud log analysis |
CN112769605A (en) * | 2020-12-30 | 2021-05-07 | 杭州东方通信软件技术有限公司 | Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform |
-
2021
- 2021-07-23 CN CN202110834688.3A patent/CN113553238B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015018164A1 (en) * | 2013-08-08 | 2015-02-12 | 中国科学院计算机网络信息中心 | Method for actively obtaining data from heterogeneous enterprise information system |
EP3454214A1 (en) * | 2017-09-08 | 2019-03-13 | Accenture Global Solutions Limited | Infrastructure instantiation, collaboration, and validation architecture for serverless execution frameworks |
CN109471846A (en) * | 2018-11-02 | 2019-03-15 | 中国电子科技网络信息安全有限公司 | User behavior auditing system and method on a kind of cloud based on cloud log analysis |
CN112769605A (en) * | 2020-12-30 | 2021-05-07 | 杭州东方通信软件技术有限公司 | Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform |
Non-Patent Citations (3)
Title |
---|
VETHANAYAGAM 等: "Threat Identification from Access Logs Using Elastic Stack", COMPUTER SCIENCE MASTERS PAPERS, 30 November 2020 (2020-11-30) * |
郭鹏程;李迎春;付春燕;苏云霞;曹炳尧;: "基于ELK的视频会议设备日志管理分析系统", 工业控制计算机, no. 08, 25 August 2017 (2017-08-25) * |
黎德生;金连文;李磊;李小宁;: "基于运行信息机制的OpenStack云平台容错改进方案", 华中科技大学学报(自然科学版), no. 1, 15 December 2012 (2012-12-15) * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114257495A (en) * | 2021-11-16 | 2022-03-29 | 国家电网有限公司客户服务中心 | Automatic processing system for abnormity of cloud platform computing node |
CN114979100A (en) * | 2022-04-15 | 2022-08-30 | 深信服科技股份有限公司 | Cloud resource checking method and related device |
CN114979100B (en) * | 2022-04-15 | 2024-02-23 | 深信服科技股份有限公司 | Cloud resource inspection method and related device |
CN115762090A (en) * | 2022-12-05 | 2023-03-07 | 中信银行股份有限公司 | Financial-level system intelligent monitoring and early warning method and system based on convolutional neural network |
CN116360797A (en) * | 2023-06-02 | 2023-06-30 | 北京长亭科技有限公司 | DSL-based security baseline creation method, system, device and medium |
CN116360797B (en) * | 2023-06-02 | 2023-10-27 | 北京长亭科技有限公司 | DSL-based security baseline creation method, system, device and medium |
CN116643950A (en) * | 2023-07-19 | 2023-08-25 | 浩鲸云计算科技股份有限公司 | FaaS-based cloud native application automatic operation and maintenance method |
CN116643950B (en) * | 2023-07-19 | 2023-10-20 | 浩鲸云计算科技股份有限公司 | FaaS-based cloud native application automatic operation and maintenance method |
CN116755992A (en) * | 2023-08-17 | 2023-09-15 | 青岛民航凯亚系统集成有限公司 | Log analysis method and system based on OpenStack cloud computing |
CN116755992B (en) * | 2023-08-17 | 2023-12-01 | 青岛民航凯亚系统集成有限公司 | Log analysis method and system based on OpenStack cloud computing |
CN118227424A (en) * | 2024-05-24 | 2024-06-21 | 山东浪潮数字商业科技有限公司 | Log record template access method and system applied to distributed component |
CN118227424B (en) * | 2024-05-24 | 2024-07-26 | 山东浪潮数字商业科技有限公司 | Log record template access method and system applied to distributed component |
Also Published As
Publication number | Publication date |
---|---|
CN113553238B (en) | 2024-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113553238A (en) | Cloud platform resource exception automatic processing system and method | |
US11151083B2 (en) | Generating target application packages for groups of computing devices | |
US9753826B2 (en) | Providing fault injection to cloud-provisioned machines | |
JP2023500228A (en) | ML-based event handling | |
US8713526B2 (en) | Assigning runtime artifacts to software components | |
CN106227654B (en) | A kind of test platform | |
WO2004021207A1 (en) | Systems and methods for improving service delivery | |
WO2014120192A1 (en) | Error developer association | |
CN113656245B (en) | Data inspection method and device, storage medium and processor | |
CN110489310B (en) | Method and device for recording user operation, storage medium and computer equipment | |
US8024706B1 (en) | Techniques for embedding testing or debugging features within a service | |
CN112068981B (en) | Knowledge base-based fault scanning recovery method and system in Linux operating system | |
CN115658529A (en) | Automatic testing method for user page and related equipment | |
CN115033419A (en) | Method and system for realizing hardware fault self-healing | |
CN103026337A (en) | Distillation and reconstruction of provisioning components | |
CN116610567A (en) | Early warning method and device for abnormal application program, processor and electronic equipment | |
KR20230062761A (en) | System hindrance integration management method | |
CN114297961A (en) | Chip test case processing method and related device | |
CN112631763A (en) | Program changing method and device of host program | |
CN110489256B (en) | Downtime positioning and repairing method and system | |
KR100496958B1 (en) | System hindrance integration management method | |
CN116880872A (en) | Cluster firmware combination upgrading method, system, terminal and storage medium | |
CN115544518A (en) | Vulnerability scanning engine implementation method and device, vulnerability scanning method and electronic equipment | |
CN115576816A (en) | Linux operating system-based android application function automatic testing method and device | |
CN114064510A (en) | Function testing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |