WO2020001642A1 - 一种运维系统及方法 - Google Patents

一种运维系统及方法 Download PDF

Info

Publication number
WO2020001642A1
WO2020001642A1 PCT/CN2019/093812 CN2019093812W WO2020001642A1 WO 2020001642 A1 WO2020001642 A1 WO 2020001642A1 CN 2019093812 W CN2019093812 W CN 2019093812W WO 2020001642 A1 WO2020001642 A1 WO 2020001642A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
maintenance
module
evaluation
model
Prior art date
Application number
PCT/CN2019/093812
Other languages
English (en)
French (fr)
Inventor
刘丽霞
吉锋
文韬
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to KR1020217001839A priority Critical patent/KR102483025B1/ko
Priority to EP19826453.3A priority patent/EP3798846B1/en
Priority to US17/256,618 priority patent/US11947438B2/en
Publication of WO2020001642A1 publication Critical patent/WO2020001642A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3428Benchmarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3495Performance evaluation by tracing or monitoring for systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Definitions

  • the embodiments of the present application relate to, but are not limited to, an operation and maintenance system and method.
  • the methods in the operation and maintenance field have the following types:
  • the operation and maintenance engineer retrieves and views the log level of the log (such as troubleshooting, warning, error, and information).
  • Fatal error (fatal)) or error code (such as 400, ORA-01500 and other specific codes) combined with rich operation and maintenance experience to quickly lock the location of the fault;
  • This operation and maintenance method is more effective in mature and stable small enterprises, In the current large, complex, and constantly overlapping new software clusters, due to the huge amount of log data, the variety of log types, and the need for efficient and effective operation and maintenance, it seems inadequate.
  • the other is a log analysis tool. This kind of tool was mainly based on analyzing the user's operation logs.
  • An embodiment of the present application provides an operation and maintenance system, including: an interconnected data acquisition module, a data storage module, an abnormality and fault labeling module, a model automatic training and evaluation module, an operation and maintenance management and task execution module, and a result review module;
  • the data collection module is configured to collect a variety of log source data required by the operation and maintenance system and store the multiple log source data in a data storage module;
  • the data storage module is configured to store the log source data, Operation and maintenance results, labeling results, models, and knowledge bases are stored;
  • abnormal and fault labeling modules are set to continuously label some source data in the data storage module for abnormalities and faults, and store the labeling results in the data storage module;
  • the automatic training and evaluation module is set to continuously generate and update multiple operation and maintenance models and knowledge bases, and stores the multiple operation and maintenance models and knowledge bases to the data storage module;
  • the operation and maintenance management and task execution module is set to Set up operation and maintenance tasks and execute operation and maintenance tasks, call operation and maintenance models and knowledge bases, store
  • An embodiment of the present application further provides an operation and maintenance method, which includes: a data collection module collects multiple log source data required by the operation and maintenance system, and stores the multiple log source data in a data storage module; The log source data, operation and maintenance results, labeling results, models, and knowledge base are stored; the abnormality and fault labeling module continuously performs abnormality and fault labeling on some source data in the data storage module, and stores the labeling results to the data storage Module model automatic training and evaluation module continuously generates and updates multiple operation and maintenance models and knowledge bases, and stores the multiple operation and maintenance models and knowledge bases to the data storage module; the operation and maintenance management and task execution module performs operations and maintenance tasks Set up and execute operation and maintenance tasks, call operation and maintenance models and knowledge bases, store and output operation and maintenance results; the result audit module audits the exceptions and faults output by the operation and maintenance management and task execution modules, and checks the abnormalities and The fault is output to the abnormal and fault labeling module.
  • a data collection module collects multiple log source data required by the operation and maintenance system, and stores
  • FIG. 1 is a schematic structural diagram of an operation and maintenance system provided by Embodiment 1 of the present invention
  • FIG. 2 is a schematic structural diagram of an operation and maintenance system in related technologies
  • Embodiment 3 is a schematic structural diagram of an operation and maintenance system provided by Embodiment 2 of the present invention.
  • Embodiment 4 is a schematic flowchart of an operation and maintenance method provided by Embodiment 3 of the present invention.
  • FIG. 5 is a schematic flowchart of an operation and maintenance method provided in Embodiment 4 of the present invention.
  • embodiments of the present invention provide a new operation and maintenance system and method to ensure that the entire operation and maintenance system can implement adaptive update, Self-improvement and evolution gradually improve the efficiency of operation and maintenance.
  • FIG. 1 is a schematic structural diagram of an operation and maintenance system provided by Embodiment 1 of the present invention.
  • the operation and maintenance system includes: an interconnected data acquisition module, a data storage module, an abnormality and fault labeling module, a model automatic training and evaluation module, an operation and maintenance management and task execution module, and a result review module.
  • the data collection module is configured to collect multiple log source data required by the operation and maintenance system, and store the multiple log source data in a data storage module.
  • the data storage module is configured to store the log source data, operation and maintenance results, labeling results, models, and knowledge base.
  • the abnormality and fault labeling module is configured to continuously perform abnormality and fault labeling on part of the source data in the data storage module, and store the labeling results in the data storage module.
  • the model automatic training and evaluation module is configured to continuously generate and update multiple operation and maintenance models and knowledge bases, and store the multiple operation and maintenance models and knowledge bases to a data storage module.
  • the operation and maintenance management and task execution module is configured to set and perform the operation and maintenance tasks, call the operation and maintenance model and knowledge base, and store and output the operation and maintenance results.
  • the result audit module is set to audit the abnormalities and faults output by the operation and maintenance management and task execution module, and output the abnormalities and faults after the audit confirmation to the abnormality and fault labeling module.
  • the multiple log source data collected by the data acquisition module includes application system logs, operating system resource status logs, abnormal log data, streaming log data, detailed operation and maintenance records, and third-party labeled data. ;
  • the data acquisition module adopts the data acquisition mode of periodic scanning and batch transmission, and for the abnormal log data and streaming log data, the data acquisition module uses the data acquisition mode of real-time acquisition and real-time transmission.
  • the data acquisition module adopts the data acquisition mode of timing scanning and batch transmission;
  • the abnormality and fault labeling source data of the abnormality and fault labeling module includes detailed operation and maintenance records and third-party labeling data stored in the data storage module, and the abnormalities and faults after the verification and confirmation output by the result review module, and the model is automatically trained. And the evaluation module is used as data for training test verification.
  • the method for performing the abnormality and fault labeling by the abnormality and fault labeling module includes: manual, semi-manual, semi-supervised learning, and transfer learning.
  • the manual mode indicates that the detailed operation and maintenance records extracted from the data storage module according to the fault occurrence module, fault type, and fault cause are marked.
  • the semi-manual mode indicates that the abnormalities and faults after the audit and confirmation output by the result audit module are marked according to the fault occurrence module, fault type, and fault cause.
  • the semi-supervised learning method indicates that the semi-supervised learning algorithm is used to mark the data that the model automatic training and evaluation module uses as training test verification by using some labeled annotation samples.
  • the transfer learning method refers to using transfer learning technology to learn similar third-party tag data to generate tag data required by the operation and maintenance system.
  • the manner in which the model automatic training and evaluation module generates and updates multiple operation and maintenance models and knowledge bases includes: real-time data processing modeling and evaluation, batch data processing modeling and evaluation.
  • the real-time data processing modeling and evaluation refers to extracting real-time log data from the real-time database in the data storage module, processing the data according to the real-time task requirements, sorting the data in chronological order, and time the data in a specific time window Segmentation, and then use simple relationship determination, statistical analysis to find and extract abnormal patterns.
  • the batch data processing modeling and evaluation indicates that according to the differences in operation and maintenance tasks and labeled data preparation conditions, single model training and evaluation, integrated model training and evaluation, and incremental model training and evaluation are selected for abnormal patterns found and extracted, Generate multiple operation and maintenance models and knowledge bases.
  • the single model training and evaluation includes: selecting a suitable algorithm from three types of supervised model, unsupervised model and semi-supervised model for training according to task type, abnormal and fault labeling data, training test and verification data. And evaluation generates single model.
  • the integrated model training and evaluation includes: when a single model training and evaluation result is unstable, using multiple single models to adopt a suitable integration mode to obtain a stable optimal result.
  • the incremental mode training and evaluation includes: after new log data arrives, model parameters update, model retraining, and evaluation update operations need to be performed on existing operation and maintenance models.
  • multiple operation and maintenance models and knowledge bases include: real-time models, general models, knowledge bases, and incremental models.
  • the real-time model includes a simple abnormal rule and a static threshold parameter for real-time calculation in the log data found in the scenario.
  • the general model includes algorithms and corresponding parameters formed after training and evaluation of a single model in a batch computing scenario, dynamic thresholds that change with time and data, integrated models and an integrated framework formed after evaluation, and corresponding algorithms and parameter.
  • the knowledge base includes: complex rules, association relationships, link propagation maps, knowledge maps, and fault trees found during the comprehensive training and evaluation of the model.
  • the incremental model includes model parameter adjustments and model type adjustments made by existing operation and maintenance models to adapt to new data, including both single-model increments and integrated model increments.
  • FIG. 2 is a schematic structural diagram of an operation and maintenance system in the related art
  • FIG. 3 is a schematic structural diagram of an operation and maintenance system provided in Embodiment 2 of the present application.
  • the operation and maintenance system includes: a data acquisition module, a data storage module, a model automatic training and evaluation module, an operation and maintenance management, and a task execution module; it also includes an abnormality and fault labeling module and a result review module.
  • the data collection module is configured to collect various log source data required by the intelligent operation and maintenance system, and store the multiple log source data in the data storage module.
  • the data collection module mainly implements the collection of various types and forms of log data: in addition to the common application system log batch collection, operating system resource status log collection, and abnormal log real-time collection, it also increases the collection and operation of streaming log data. Collection of dimensional detailed record tables, and collection of third-party labeled data.
  • the data collection module runs a data collection interface configuration wizard separately, and the wizard starts a corresponding collection scheme according to the rate and data type of the data to be collected.
  • the wizard presets three different collection schemes: 1) For abnormal log data and streaming log data, the data collection mode is real-time collection and real-time transmission. The collected data is directly transferred to the real-time memory database in the data storage module. 2) For detailed operation and maintenance records and third-party labeled data, the data acquisition mode of timing scanning and batch transmission is used, and the collected data is directly transferred to the distributed database in the data storage module; 3) for application system logs and operations The system resource status logs all use the data collection mode of periodic scanning and batch transmission, but the data collected at this time is directly stored in the distributed file system in the data storage module.
  • the data storage module is configured to store data necessary for the system such as log source data, operation and maintenance results, labeling results, models, and knowledge bases.
  • the data storage module mainly implements storage of log data, abnormal and fault labeling results, multiple models and knowledge bases, and so on.
  • the data storage module mainly stores log source data, log analysis results, abnormal and fault labeling results, multiple models and knowledge bases. Result data in the other modules can also be stored as needed.
  • this module adds the storage of the abnormal and fault labeling results required by the intelligent operation and maintenance system, the storage of the model and the knowledge base.
  • a variety of different data can be distributed and classified according to data type, data form, and data collection rate: such as unstructured and semi-structured application system logs and operating system resource status log source data can be stored in the distributed file system
  • the detailed operation and maintenance records and third-party labeled data can be optionally stored in a distributed database.
  • the anomalies and streaming data collected in real time can be stored in the memory database first and then considered whether to be transferred to the distributed file system or distributed according to the needs. Database.
  • the abnormality and fault labeling module is configured to continuously perform abnormality and fault labeling on part of the source data in the data storage module, and store the labeling result in the data storage module.
  • Part of the source data in the abnormality and fault labeling module includes detailed operation and maintenance records, identified abnormalities and faults output by the result review module, third-party labeled data, and data used by the model automatic training and evaluation module as training test verification. ;
  • the abnormality and fault labeling in the abnormality and fault labeling module includes four methods: manual, semi-manual, semi-supervised learning, and transfer learning; the specific operation mode is: manual operation to extract detailed operation and maintenance records from the data store on demand , Label the data according to the fault generating module, fault type, and fault reason; semi-manually indicate the abnormality and fault after manual confirmation output by the result review module; label the data according to the fault generating module, fault type, and fault reason; semi-supervised learning Means using semi-supervised learning algorithms and some labeled annotated samples to annotate other unlabeled data (the model's automatic training and evaluation module is used as training test verification data); transfer learning means using transfer learning technology, similar The third-party labeling data learning generates labeling data required by the operation and maintenance system (the model automatic training and evaluation module is used as training test verification data).
  • the model automatic training and evaluation module is configured to continuously generate and update multiple operation and maintenance models and knowledge bases, and store the multiple operation and maintenance models and knowledge bases to a data storage module.
  • the model automatic training and evaluation module continuously generates and updates a variety of operation and maintenance models and knowledge bases.
  • the generation and update methods include: real-time data processing modeling and evaluation, batch data processing modeling and evaluation.
  • Real-time data processing modeling and evaluation is to extract real-time log data from the real-time database in the data storage module and process the data according to the real-time task requirements, such as sorting the data in chronological order and time-slicing the data according to a specific time window And so on, and then use simple relationship determination, statistical analysis, etc. to find and extract abnormal patterns.
  • Batch data processing modeling and evaluation are based on different operation and maintenance tasks and labeled data preparation. For example, the models and evaluation methods that can be selected are divided into single-mode model training and evaluation, integrated model training and evaluation, and incremental model training and evaluation.
  • the main goal of the model automatic training and evaluation module is to generate and update the real-time models, general models, knowledge bases, and incremental models that the operation and maintenance management and task execution modules need to call when performing automatic abnormal discovery, rapid fault location, and early warning of faults.
  • the operation and maintenance automatic training and evaluation module is further divided into four sub-modules of data processing, single model training and evaluation, integrated model training and evaluation, incremental model training and evaluation. The role and function of each sub-module are different. For dimension tasks and data quality, different methods in each sub-module are selected for data preprocessing, model training, and model evaluation.
  • the single model training and evaluation includes: selecting an appropriate algorithm from three types of supervised model, unsupervised model and semi-supervised model according to task type, abnormal and fault labeling data, training test and verification data, such as abnormal pattern discovery Due to the variety of abnormal patterns, but the frequency of the task is not high, the number of samples is small, so unsupervised models are generally the main task; fault location and fault early warning are generally supervised models, supplemented by semi-supervised models. .
  • the integrated model training and evaluation includes: when a single model training and evaluation result is unstable, a plurality of single models are adopted with an appropriate integration mode to obtain a stable and optimal result.
  • the incremental mode training and evaluation includes updating operations such as updating model parameters, retraining and evaluating existing operation and maintenance models after new log data arrives.
  • the various operation and maintenance models and knowledge bases include: real-time models, general models, knowledge bases, and incremental models.
  • the real-time model includes: simple anomaly rules, static threshold parameters, and the like used to find log data in real-time computing scenarios.
  • the general model includes: algorithms and corresponding parameters, dynamic thresholds that change with time and data, etc. for the single model training and evaluation in a batch calculation scenario, an integrated model and an integrated framework formed after evaluation, and corresponding Algorithms and parameters.
  • the knowledge base includes: complex rules, association relationships, link propagation diagrams, knowledge maps, fault trees, etc. found in the various stages of comprehensive training and evaluation of the operation and maintenance model. This part of the model can be directly applied to real-time log data. Real-time anomaly detection can also be applied to batch log data to predict failures in advance.
  • the incremental model includes model parameter adjustments, model type adjustments, etc., of how existing operation and maintenance models are adapted to new data, including both single-model increments and integrated model increments.
  • the model automatic training and evaluation module is started, classified and executed by the operation and maintenance management and task execution module as needed: a) real-time data processing, modeling and evaluation: real-time logs are extracted from the real-time database in the data storage module The data is processed according to real-time task requirements, such as sorting the data in chronological order, time-slicing the data according to a specific time window, etc., and then using simple relationship determination and statistical analysis to discover and extract abnormal patterns. b) Batch data processing, modeling, and evaluation: According to the different operation and maintenance tasks and labeled data preparation, for example, the models and evaluation methods that can be selected are divided into single-mode model training and evaluation, integrated model training and evaluation, and incremental model training And evaluation.
  • single model training and evaluation mainly selects appropriate algorithms from the three types of supervised model, unsupervised model, and semi-supervised model based on task type, abnormal and fault labeling data. It is not high, so it has a small number of samples, so it is generally based on unsupervised models; fault location and fault early warning are generally based on supervised models, supplemented by semi-supervised models; integrated model training and evaluation is to make up for
  • the model uses multiple integration models to obtain stable and optimal results when the results are unstable when the task type is polymorphic. Incremental mode training and evaluation are to meet the situation of emerging new log data.
  • the existing operation and maintenance models can be kept up-to-date.
  • the results of automatic training and evaluation of the model are stored in the data storage module in the form of models and knowledge bases.
  • the models and knowledge bases are divided into the following four types according to their respective application scenarios during storage: a) real-time models, which are mainly set to real-time calculations Simple exception rules, static threshold parameters, etc. were found in the log data in the scenario. b) General model, which is mainly set as the algorithm formed after training and evaluation of a single model in a batch computing scenario and the corresponding parameters, dynamic thresholds that change with time and data, etc., and the integrated model and the integrated framework formed after evaluation and the corresponding algorithm and parameter. c) Knowledge base category, which is mainly the complex rules, association relationships, link propagation diagrams, knowledge maps, fault trees, etc. found in the various stages of comprehensive training and evaluation of operation and maintenance models. This part of the model can be directly applied to real-time log data.
  • Real-time anomaly detection can also be applied to batch log data to predict failures in advance; d) Incremental model, incremental performance evaluation of single model training and evaluation, integrated model training and evaluation model, incremental performance
  • the good model is used as an incremental model alone to meet the adaptability of the entire intelligent operation and maintenance system to new data. When the incremental model is called, it is selected according to whether the incremental model recalculates all data or only incremental data for incremental data.
  • the operation and maintenance management and task execution modules achieve unified management and task capabilities of the operation and maintenance system: log query and key performance indicator (Key Performance Indicator (KPI)) monitoring task execution and result display, manual fault location and result display, Exception rule filtering execution and result display, static threshold setting and execution result display, abnormal automatic discovery related model call and result display, fault fast location related model call and result display, fault early warning related model Call and display of results, start of model automatic training and evaluation module and management of results, classification management and update of multiple models.
  • KPI Key Performance Indicator
  • the abnormal automatic discovery related model call and the display of the results Compared with ordinary operation and maintenance systems, the abnormal automatic discovery related model call and the display of the results, the fault fast location related model call and the display of the results, the fault early warning related model call and the display of the results, the model automatic training and evaluation module
  • the start-up and result management, classification management and update of multiple models are all new functions of this module.
  • the operation and maintenance management and task execution module provides functions such as log query, multiple KPI monitoring, abnormality detection, and fault early warning based on system configuration and model callability, and on the other hand tracks the results of KPI monitoring and exception rules.
  • the operation and maintenance management and task execution module is responsible for the output of the operation and maintenance results.
  • the operation and maintenance management and task execution module continues to monitor the collection of new log data, and in turn starts the abnormal and fault labeling module, the model automatic training and evaluation module, and generates new models and knowledge bases or performs analysis on existing models and knowledge bases. Update, subsequent iterative execution of operation and maintenance tasks, results review and so on, so as to achieve self-renewal, iteration and evolution of system operation and maintenance capabilities.
  • the result audit module is set to audit the abnormalities and faults output by the operation and maintenance management and task execution modules, and output the abnormalities and faults after the audit confirmation to the abnormality and fault labeling module.
  • the result review module is mainly responsible for manually reviewing and confirming the operation and maintenance results generated by the operation and maintenance management module.
  • the valid exceptions and faults are introduced into the exception and fault labeling module as a data labeling method to continuously expand and accumulate labeled data.
  • the technical solution provided in the second embodiment of the present application can efficiently perform automatic abnormal discovery, rapid fault location, and early warning of faults in the case of various types and forms of log data and complex operation and maintenance requirements, and the entire intelligent operation and maintenance system can implement automatic Adapt to update, self-iterative, and gradually evolve.
  • FIG. 4 is a schematic flowchart of an operation and maintenance method provided in Embodiment 3 of the present application. As shown in FIG. 4, the operation and maintenance method includes steps 401 to 406.
  • step 401 the data collection module collects multiple log source data required by the operation and maintenance system, and stores the multiple log source data in a data storage module.
  • the data storage module stores the log source data, operation and maintenance results, labeling results, models, and knowledge base.
  • the abnormality and fault labeling module continuously performs abnormality and fault labeling on part of the source data in the data storage module, and stores the labeling result to the data storage module.
  • step 404 the model automatic training and evaluation module continuously generates and updates multiple operation and maintenance models and knowledge bases, and stores the multiple operation and maintenance models and knowledge bases to a data storage module.
  • step 405 the operation and maintenance management and task execution module sets up the operation and maintenance task, executes the operation and maintenance task, calls the operation and maintenance model and knowledge base, and stores and outputs the operation and maintenance result.
  • step 406 the result auditing module audits the abnormalities and faults output by the operation and maintenance management and task execution module, and outputs the abnormalities and faults confirmed by the audit to the abnormality and fault labeling module.
  • the multiple log source data collected by the data acquisition module includes application system logs, operating system resource status logs, abnormal log data, streaming log data, detailed operation and maintenance records, and third-party labeled data. ;
  • the data acquisition module adopts the data acquisition mode of periodic scanning and batch transmission, and for the abnormal log data and streaming log data, the data acquisition module uses the data acquisition mode of real-time acquisition and real-time transmission.
  • the data acquisition module adopts the data acquisition mode of timing scanning and batch transmission.
  • the abnormality and fault labeling source data of the abnormality and fault labeling module includes detailed operation and maintenance records and third-party labeling data stored in the data storage module, and the abnormalities and faults after the verification and confirmation output by the result review module, and the model is automatically trained. And the evaluation module is used as data for training test verification.
  • the method for performing the abnormality and fault labeling by the abnormality and fault labeling module includes: manual, semi-manual, semi-supervised learning, and transfer learning.
  • the manual mode indicates that the detailed operation and maintenance records extracted from the data storage module according to the fault occurrence module, fault type, and fault cause are marked.
  • the semi-manual mode indicates that the abnormalities and faults after the audit and confirmation output by the result audit module are marked according to the fault occurrence module, fault type, and fault cause.
  • the semi-supervised learning method indicates that the semi-supervised learning algorithm is used to mark the data that the model automatic training and evaluation module uses as training test verification by using some labeled annotation samples.
  • the transfer learning method refers to using transfer learning technology to learn similar third-party tag data to generate tag data required by the operation and maintenance system.
  • the manner in which the model automatic training and evaluation module generates and updates multiple operation and maintenance models and knowledge bases includes: real-time data processing modeling and evaluation, batch data processing modeling and evaluation.
  • the real-time data processing modeling and evaluation refers to extracting real-time log data from the real-time database in the data storage module, processing the data according to the real-time task requirements, sorting the data in chronological order, and time the data according to a specific time window. Segmentation, and then use simple relationship determination, statistical analysis to find and extract abnormal patterns.
  • the batch data processing modeling and evaluation indicates that according to the differences in operation and maintenance tasks and labeled data preparation conditions, single model training and evaluation, integrated model training and evaluation, and incremental model training and evaluation are selected for abnormal patterns found and extracted, Generate multiple operation and maintenance models and knowledge bases.
  • the single model training and evaluation includes: selecting a suitable algorithm from three types of supervised model, unsupervised model and semi-supervised model for training according to task type, abnormal and fault labeling data, training test and verification data And evaluation generates single model.
  • the integrated model training and evaluation includes: when a single model training and evaluation result is unstable, using multiple single models to adopt a suitable integration mode to obtain a stable optimal result.
  • the incremental mode training and evaluation includes: after new log data arrives, model parameters update, model retraining, and evaluation update operations need to be performed on existing operation and maintenance models.
  • various operation and maintenance models and knowledge bases include: real-time models, general models, knowledge bases, and incremental models.
  • the real-time model includes a simple abnormal rule and a static threshold parameter for real-time calculation in the log data found in the scenario.
  • the general model includes algorithms and corresponding parameters formed after training and evaluation of a single model in a batch computing scenario, dynamic thresholds that change with time and data, integrated models and an integrated framework formed after evaluation, and corresponding algorithms and parameter.
  • the knowledge base includes: complex rules, association relationships, link propagation maps, knowledge maps, and fault trees found during the comprehensive training and evaluation of the model.
  • the incremental model includes model parameter adjustments and model type adjustments made by existing operation and maintenance models to adapt to new data, including both single-model increments and integrated model increments.
  • Embodiment 3 of the present application is described in detail through a specific embodiment below.
  • FIG. 5 is a schematic flowchart of an operation and maintenance method provided in Embodiment 4 of the present application. As shown in FIG. 5, the operation and maintenance method includes steps 501 to 506.
  • step 501 data is collected.
  • the data collection includes: 1. Real-time collection of abnormal logs, real-time collection of abnormal logs of important applications / operations in the working cluster; 2. Collection of operating system resource status, batch collection of application system logs; 4. Third-party labeled data The collection is set to supplement the lack of anomaly and fault labeling data of the operation and maintenance system.
  • the migration and learning technology is mainly used to migrate the labeling data that is close to the outside world. 5.
  • the operation and maintenance detailed record table collection is directly used as anomaly and fault labeling Data; 6.
  • Stream log data collection which mainly collects real-time collection of transaction-type, real-time transmission / operation-type streaming data in a big data environment.
  • step 502 the data is stored.
  • the data storage includes: 1. Log source data storage, generally stored in the file system; 2. Log analysis result storage, generally stored in a database or data warehouse; 3. Marking result storage, set to save abnormal and fault labels Annotation results generated in the model; 4. Model and knowledge base storage, set to save a variety of models and knowledge base generated in the model automatic training and evaluation module.
  • step 503 abnormalities and faults are marked.
  • the abnormality and fault labeling includes: 1. abnormal event labeling, confirming abnormal data collected in the system, and labeling truly abnormal events; 2. fault type labeling, fault data collected in the system, and Annotate the fault type; 3. Migrate the labeled data, and use the migration learning technology to form third-party labeled data to use the system for abnormal and fault labeled data.
  • step 504 the model is automatically trained and evaluated.
  • automatic model training and evaluation including: 1. Data preprocessing, responsible for data preparation in the model automatic training and evaluation module, including but not limited to sample data extraction, data analysis and unified format, feature extraction and construction, data Balance processing and so on; 2. Single model training and evaluation, select single model training according to the current status of log source data storage, the status of labeled result storage, and task type (automatic abnormality discovery / fast fault location / fault early warning) in the data storage module.
  • One or more algorithms in unsupervised model training and evaluation, semi-supervised model training and evaluation, and supervised model training and evaluation in training and evaluation, and the formed algorithms and parameters, associations, Link propagation, complex rules, knowledge maps, fault trees, etc. are used as models or knowledge bases into the models and knowledge base storage of the data storage module; 3.
  • Integrated model training and evaluation based on single model training and evaluation, based on the model Can continue to choose the integrated model training and evaluation Optimization; 4. Incremental model training and evaluation. As the collected data continues to increase, existing models and knowledge bases need to be updated. This can be achieved through incremental model training and evaluation.
  • step 505 operation and maintenance management and task execution.
  • operation and maintenance management and task execution include: log query and KPI monitoring task execution and result display, manual fault location and result display, execution of abnormal rule filtering and result display, setting of static thresholds and execution result Display, call of abnormal auto-discovery related model and result display, call of fault fast localization related model and result display, call of fault early warning related model and result display, automatic model training and evaluation module startup and result management , Classification management and update of multiple models, etc.
  • step 506 the results are reviewed.
  • the result audit includes: reviewing the abnormal and fault-related results in the operation and maintenance management and task execution module, on the one hand, output all the abnormalities and faults after the audit, and on the other hand, transmit the confirmed abnormalities and faults. Enter the exception and fault labeling module.
  • the operation and maintenance methods provided in the third and fourth embodiments can be applied to the operation and maintenance systems provided in the first and second embodiments.
  • Embodiments 3 and 4 of this application can efficiently perform automatic abnormal discovery, rapid fault location, and early warning of faults in the case of various types and forms of log data and complex operation and maintenance requirements, and the entire intelligent operation and maintenance system can Achieve adaptive update, self-iteration, and progressive evolution.
  • computer storage medium includes both volatile and nonvolatile implementations in any method or technology arranged to store information such as computer-readable instructions, data structures, program modules or other data. Removable, removable and non-removable media.
  • Computer storage media include, but are not limited to, random access memory (RAM), read-only memory (Read Only Memory, ROM), electrically erasable and programmable read-only memory (Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc-Read Only Memory (CD-ROM), Digital Versatile Disc Digital Versatile Disc (DVD) or other optical disc storage, magnetic box, tape, disk storage Or other magnetic storage device, or any other medium that can be set to store desired information and can be accessed by a computer.
  • a communication medium typically contains computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

一种运维系统及方法,该系统包括相互连接的数据采集模块、数据存储模块、异常及故障标注模块、模型自动训练及评估模块、运维管理及任务执行模块、结果审核模块。

Description

一种运维系统及方法
本申请要求在2018年06月28日提交中国专利局、申请号为201810689427.5的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及但不限于一种运维系统及方法。
背景技术
目前随着云计算、大数据技术的成熟,各个行业在实际应用过程中积累了多种多样的海量数据,除了应用系统自身的必要数据之外,还有与之相关的底层存储介质、网络传输、操作系统、数据库及文件系统、管理系统等的日志数据,这些数据记录了系统正常操作、异常操作、故障出现之前及故障出现之后系统的变化及关联组件的连锁反应等等,是运维人员进行异常发现、故障定界、根因分析以及故障预测的依据。但是面对每时每刻不断增长的、组件之间交错复杂的、日志记录形式种类多样的运维数据,过去那种依赖人力逐步排查、脚本辅助定位、日志检索、简单统计分析、阈值监控等已经无法满足当前运维对时效性、功能性的基本需求了。
目前运维领域的方法有以下几类:一种是人工经验,运维工程师通过检索、查看日志的日志级别(如排除故障(debug)、警告(warning)、错误(error)、信息(info)、致命错误(fatal))或者错误码(如400、ORA-01500等特定码),结合丰富的运维经验快速锁定故障所在的位置;这种运维方式在成熟稳定的小型企业中比较有效,对于目前大型的、复杂的、新软件不断叠加的集群中,由于日志数据量巨大、日志类型多种多样、运维需求高效有效的情况下就显得力不从心了。另一种是日志分析工具,这种工具最早以分析用户的操作日志为主,在了解用户的操作习惯、行为爱好的基础上进行系统优化、精准营销等等,后来应用进一步扩展到运维上,但是这些工具的主要功能是对日志进行统一收集、解析、存储之后提供日志检索、简单的统计分析及可视化展示(如用户的访问量(Unique Visitor,UV)、页面的访问量(Page View,PV)等等),这些工具随着云计算及大数据的发展对底层架构也进行了更新,可以满足复杂多样的、海量的日志的快速检索、简单统计分析、实时监控等需求,但是对于运维领域中的异常自动发现、故障快速定位、故障提前预警等高级运维需求没法满足。
而如何在云计算、大数据的基础上,利用人工智能技术实现运维的智能化,当前及后续较长一段时间内各中大型企业积极探索的内容。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本申请实施例提供了一种运维系统,包括:相互连接的数据采集模块、数据存储模块、异常及故障标注模块、模型自动训练及评估模块、运维管理及任务执行模块、结果审核模块;其中,数据采集模块,设置为采集所述运维系统所需要的多种日志源数据并将所述多种日志源数据存储在数据存储模块;数据存储模块,设置为对所述日志源数据、运维结果、标注结果、模型及知识库进行存储;异常及故障标注模块,设置为持续地对数据存储模块中的部分源数据进行异常及故障标注,并将标注结果存储到数据存储模块;模型自动训练及评估模块,设置为持续地生成并更新多种运维模型及知识库,并将所述多种运维模型和知识库存储到数据存储模块;运维管理及任务执行模块,设置为对运维任务进行设置并执行运维任务、调用运维模型及知识库、存储及输出运维结果;结果审核模块,设置为对运维管理及任务执行模块输出的异常及故障进行审核,并将审核确认后的异常及故障输出到异常及故障标注模块。
本申请实施例还提供了一种运维方法,包括:数据采集模块采集运维系统所需要的多种日志源数据,并将所述多种日志源数据存储在数据存储模块;数据存储模块对所述日志源数据、运维结果、标注结果、模型及知识库进行存储;异常及故障标注模块持续地对数据存储模块中的部分源数据进行异常及故障标注,并将标注结果存储到数据存储模块模型自动训练及评估模块持续地生成并更新多种运维模型及知识库,并将所述多种运维模型和知识库存储到数据存储模块;运维管理及任务执行模块对运维任务进行设置并执行运维任务、调用运维模型及知识库、存储及输出运维结果;结果审核模块对运维管理及任务执行模块输出的异常及故障进行审核,并将审核确认后的异常及故障输出到异常及故障标注模块。
本申请本申请本申请在阅读并理解了附图和详细描述后,可以明白其他方面。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1为本发明实施例一提供的运维系统的结构示意图;
图2为相关技术中运维系统的结构示意图;
图3为本发明实施例二提供的运维系统的结构示意图;
图4为本发明实施例三提供的运维方法的流程示意图;
图5为本发明实施例四提供的运维方法的流程示意图。
具体实施方式
下文中将结合附图对本发明的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
为了在云计算、大数据的基础上,利用人工智能技术实现运维的智能化,本发明实施例提供了一种新的运维系统及方法,来保证整个运维系统能够实现自适应更新、自我提升、逐步进化,显著提升运维的效率。
实施例一
图1为本发明实施例一提供的运维系统的结构示意图。如图1所示,该运维系统,包括:相互连接的数据采集模块、数据存储模块、异常及故障标注模块、模型自动训练及评估模块、运维管理及任务执行模块、结果审核模块。
数据采集模块,设置为采集所述运维系统所需要的多种日志源数据,并将所述多种日志源数据存储在数据存储模块。
数据存储模块,设置为对所述日志源数据、运维结果、标注结果、模型及知识库进行存储。
异常及故障标注模块,设置为持续地对数据存储模块中的部分源数据进行异常及故障标注,并将标注结果存储到数据存储模块。
模型自动训练及评估模块,设置为持续地生成并更新多种运维模型及知识库,并将所述多种运维模型和知识库存储到数据存储模块。
运维管理及任务执行模块,设置为对运维任务进行设置并执行运维任务、调用运维模型及知识库、存储及输出运维结果。
结果审核模块,设置为对运维管理及任务执行模块输出的异常及故障进行审核,并将审核确认后的异常及故障输出到异常及故障标注模块。
如此,能够在满足基本运维需求的同时,可以实现异常及故障自动发现和输出,保证整个运维系统能够实现自适应更新、自我提升、逐步进化,显著提升运维的效率。
在一实施例中,所述数据采集模块采集的多种日志源数据,包括:应用系统日志、操作系统资源状态日志、异常日志数据、流式日志数据、运维详细记 录,以及第三方标注数据;
其中,针对应用系统日志和操作系统资源状态日志,数据采集模块采用定时扫描、批量传输的数据采集模式,针对异常日志数据和流式日志数据,数据采集模块采用实时采集、实时传输的数据采集模式,针对运维详细记录和第三方标注数据,数据采集模块采用定时扫描、批量传输的数据采集模式;
所述异常及故障标注模块进行异常及故障标注的源数据,包括:数据存储模块中存储的运维详细记录和第三方标注数据、结果审核模块输出的审核确认后的异常及故障、模型自动训练及评估模块用来作为训练测试验证的数据。
在一实施例中,所述异常及故障标注模块进行异常及故障标注的方式,包括:人工、半人工、半监督学习、迁移学习四种方式。
所述人工方式,表示按照故障发生模块、故障类型、故障原因对数据存储模块中按需抽取的运维详细记录进行标注。
所述半人工方式,表示按故障发生模块、故障类型、故障原因对结果审核模块输出的审核确认后的异常及故障进行标注。
所述半监督学习方式,表示利用半监督学习算法、部分已经标注好的标注样本对模型自动训练及评估模块用来作为训练测试验证的数据进行标注。
所述迁移学习方式,表示利用迁移学习技术对相近的第三方标注数据学习生成所述运维系统所需的标注数据。
在一实施例中,所述模型自动训练及评估模块生成并更新多种运维模型及知识库的方式包括:实时数据处理建模及评估、批量数据处理建模及评估。
所述实时数据处理建模及评估,表示从数据存储模块中的实时数据库中抽取实时日志数据,按照实时任务需求对数据进行处理,按时间顺序对数据进行排序、按照特定时间窗口对数据进行时间切分,然后利用简单的关系判定、统计分析对异常模式进行发现与抽取。
所述批量数据处理建模及评估,表示按照运维任务的不同及标注数据准备情况,对发现和抽取的异常模式选择单模型训练及评估、集成模型训练及评估和增量模型训练及评估,生成多种运维模型及知识库。
其中,所述的单模型训练及评估,包含:根据任务类型、异常及故障标注数据、训练测试及验证数据,从监督模型、非监督模型、半监督模型三类中选择合适的算法,进行训练和评估生成单模型。
所述的集成模型训练及评估,包含:当单模型训练及评估的结果不稳定时将多个单模型采用合适的集成模式以获取稳定的最优结果。
所述的增量模式训练及评估,包含:新日志数据到来后需要对已有运维模型进行模型参数更新、模型重新训练及评估更新操作。
在一实施例中,多种运维模型及知识库,包含:实时模型、通用模型、知 识库、增量模型。
其中,所述实时模型,包含:用于实时计算场景下发现日志数据中的简单异常规则、静态阈值参数。
所述通用模型,包含:用于批量计算场景下单模型训练及评估后形成的算法及相应的参数、随时间及数据变化的动态阈值,集成模型及评估后形成的集成框架及相应的算法及参数。
所述知识库,包含:模型综合训练及评估中各阶段发现的复杂规则、关联关系、链路传播图、知识图谱、故障树。
所述增量模型,包含:已有运维模型如何适应新的数据而做出的模型参数调整、模型类型调整,既包含单模型的增量,也包含集成模型的增量。
下面通过一个具体的实施例详细阐述本申请实施例一提供的技术方案。
实施例二
图2为相关技术中运维系统的结构示意图,图3为本申请实施例二提供的运维系统的结构示意图。如图2、3所示,该运维系统包括:数据采集模块、数据存储模块、模型自动训练及评估模块、运维管理及任务执行模块;还包括:异常及故障标注模块、结果审核模块。
其中,数据采集模块,设置为采集智能运维系统所需要的多种日志源数据,并将所述多种日志源数据存储在数据存储模块。
数据采集模块主要实现对多种类型、多种形态日志数据的采集:除了常见的应用系统日志批量采集、操作系统资源状态日志采集、异常日志实时采集之外,增加对流式日志数据的采集、运维详细记录表的采集、第三方标注数据的采集。
在一实施例中,数据采集模块单独运行数据采集接口配置向导,向导根据待采集数据的速率、数据类型,启动相应的采集方案。向导预置三种不同的采集方案为:1)针对异常日志数据、流式日志数据,均采用实时采集、实时传输的数据采集模式,采集到的数据直接传入数据存储模块中的实时内存数据库;2)针对运维详细记录、第三方标注数据,均采用定时扫描、批量传输的数据采集模式,采集到的数据直接传入数据存储模块中的分布式数据库;3)针对应用系统日志、操作系统资源状态日志,均采用定时扫描、批量传输的数据采集模式,但是此时采集到的数据直接存入数据存储模块中的分布式文件系统。
其中,数据存储模块,设置为对日志源数据、运维结果、标注结果、模型及知识库等系统必要的数据进行存储。
数据存储模块主要实现对日志数据、异常及故障标注结果、多种模型及知识库等等的存储。
在一实施例中,数据存储模块主要对日志源数据的存储、日志分析结果的 存储、异常及故障标注结果的存储、多种模型及知识库的存储,其他模块中间的结果数据根据需要也可考虑在此存储。与普通运维系统相比,本模块增加了智能运维系统所需的异常及故障标注结果的存储、模型及知识库的存储。多种不同数据可根据数据类型、数据形态、数据采集速率进行分布式、分类存储:如非结构化、半结构化的应用系统日志及操作系统资源状态日志源数据可以存储在分布式文件系统中,运维详细记录和第三方标注数据可以选择存储在分布式数据库中,实时采集的异常及流式数据可以先存储在内存数据库中后根据考虑需要考虑是否转存入分布式文件系统中或者分布式数据库中。
其中,异常及故障标注模块,设置为持续地对数据存储模块中的部分源数据进行异常及故障标注,并将标注结果存储到数据存储模块。
所述异常及故障标注模块中的部分源数据,包括:运维详细记录、结果审核模块输出的确定的异常及故障、第三方标注数据、模型自动训练及评估模块用来作为训练测试验证的数据;
所述异常及故障标注模块中的异常及故障标注,包括:人工、半人工、半监督学习、迁移学习四种方式;具体操作方式为:人工表示按需抽取来自数据存储中的运维详细记录,按照故障发生模块、故障类型、故障原因对数据进行标注;半人工表示结果审核模块输出的人工确认后的异常及故障按故障发生模块、故障类型、故障原因等对数据进行标注;半监督学习表示利用半监督学习算法、部分已经标注好的标注样本对其他未进行标注的数据(模型自动训练及评估模块用来作为训练测试验证的数据)进行标注;迁移学习表示利用迁移学习技术、相近的第三方标注数据学习生成本运维系统所需的标注数据(模型自动训练及评估模块用来作为训练测试验证的数据)。
其中,模型自动训练及评估模块,设置为持续地生成并更新多种运维模型及知识库,并将所述多种运维模型和知识库存储到数据存储模块。
所述模型自动训练及评估模块,持续地生成并更新多种运维模型及知识库,生成并更新方式包含:实时数据处理建模及评估、批量数据处理建模及评估。实时数据处理建模及评估是从数据存储模块中的实时数据库中抽取实时日志数据,按照实时任务需求对数据进行处理,如按时间顺序对数据进行排序、按照特定时间窗口对数据进行时间切分等,然后利用简单的关系判定、统计分析等对异常模式进行发现与抽取。批量数据处理建模及评估是按照运维任务的不同及标注数据准备情况,例如可以选用的模型及评估方法分为单模模型训练及评估、集成模型训练及评估和增量模型训练及评估。
模型自动训练及评估模块的主要目标是生成并更新运维管理及任务执行模块在进行异常自动发现、故障快速定位、故障提前预警时所需要调用的实时模型、通用模型、知识库、增量模型。运维自动训练及评估模块又分为数据处理、 单模型训练及评估、集成模型训练及评估、增量模型训练及评估四个子模块,每个子模块的作用及功能均不相同,使用时根据运维任务、数据质量依次选择每个子模块中不同的方法进行数据预处理、模型训练、模型评估。
所述的单模型训练及评估,包含:根据任务类型、异常及故障标注数据、训练测试及验证数据,从监督模型、非监督模型、半监督模型三类中选择合适的算法,如异常模式发现任务由于异常模式变化多样,但出现频率不高,所以在样本数量上较少,所以一般多以非监督模型为主;而故障定位和故障预警一般以监督模型为主、以半监督模型为辅。
所述的集成模型训练及评估,包含:当单个模型训练评估结果不稳定时将多个单模型采用合适的集成模式以获取稳定的最优的结果。
所述的增量模式训练及评估,包含:新日志数据到来后需要对已有运维模型进行模型参数更新、模型重新训练及评估等更新操作。
所述的多种运维模型及知识库,包含:实时模型、通用模型、知识库、增量模型。所述的实时模型,包含:用于实时计算场景下发现日志数据中的简单异常规则、静态阈值参数等。所述的通用模型,包含:用于批量计算场景下单模型训练及评估后形成的算法及相应的参数、随时间及数据等变化的动态阈值,集成模型及评估后形成的集成框架及相应的算法及参数。所述的知识库,包含:运维模型综合训练及评估中各阶段发现的复杂规则、关联关系、链路传播图、知识图谱、故障树等等,这部分模式可以直接应用于实时日志数据进行实时异常检测,也可以应用于批量日志数据进行故障的提前预测。所述的增量模型,包含:已有运维模型如何适应新的数据而做出的模型参数调整、模型类型调整等,既包含单模型的增量,也包含集成模型的增量。
在一实施例中,模型自动训练及评估模块由运维管理及任务执行模块按需启动、分类执行:a)实时数据处理、建模及评估:从数据存储模块中的实时数据库中抽取实时日志数据,按照实时任务需求对数据进行处理,如按时间顺序对数据进行排序、按照特定时间窗口对数据进行时间切分等,然后利用简单的关系判定、统计分析等对异常模式进行发现与抽取。b)批量数据处理、建模及评估:按照运维任务的不同及标注数据准备情况,例如可以选用的模型及评估方法分为单模模型训练及评估、集成模型训练及评估和增量模型训练及评估。其中,单模型训练及评估主要根据任务类型、异常及故障标注数据从监督模型、非监督模型、半监督模型三类中选择合适的算法,如异常模式发现任务由于异常模式变化多样,但出现频率不高,所以在样本数量上较少,所以一般多以非监督模型为主;而故障定位和故障预警一般以监督模型为主、以半监督模型为辅;集成模型训练及评估是为了弥补单模在任务类型存在多态时结果不稳定情况下将多个单模型采用合适的集成模式以获取稳定的最优的结果;增量模式训 练及评估是为了满足面对不断涌现的新日志数据情况下已有运维模型能够保持及时更新。模型自动训练及评估的结果以模型及知识库的形式存入数据存储模块中,模型及知识库在存储时根据各自的应用场景分为以下四种:a)实时模型,主要是设置为实时计算场景下发现日志数据中的简单异常规则、静态阈值参数等。b)通用模型,主要设置为批量计算场景下单模型训练及评估后形成的算法及相应的参数、随时间及数据等变化的动态阈值,集成模型及评估后形成的集成框架及相应的算法及参数。c)知识库类,主要是运维模型综合训练及评估中各阶段发现的复杂规则、关联关系、链路传播图、知识图谱、故障树等等,这部分模式可以直接应用于实时日志数据进行实时异常检测,也可以应用于批量日志数据进行故障的提前预测;d)增量模型,将单模型训练及评估、集成模型训练及评估后得到的模型进行增量效能评估,保留增量效能较好的模型单独作为增量模型,以满足整个智能运维系统对新增数据的适应性。增量模型在调用时,按照增量模型是对全部数据进行重新计算还是仅对新增数据进行增量计算来选取是定时启动还是触发启动。
其中,运维管理及任务执行模块实现对运维系统的统一管理及任务能力:日志查询与关键绩效指标(Key Performance Indicator,KPI)监控的任务执行及结果展示、故障手动定位及结果的展示、异常规则过滤的执行及结果的展示、静态阈值的设定及执行结果的展示、异常自动发现相关模型的调用及结果的展示、故障快速定位相关模型的调用及结果的展示、故障提前预警相关模型的调用及结果的展示、模型自动训练及评估模块的启动及结果的管理、多种模型的分类管理及更新。与普通运维系统相比,异常自动发现相关模型的调用及结果的展示、故障快速定位相关模型的调用及结果的展示、故障提前预警相关模型的调用及结果的展示、模型自动训练及评估模块的启动及结果的管理、多种模型的分类管理及更新均是此模块新增功能。
在实施例中,运维管理及任务执行模块一方面根据系统配置及模型可调用情况提供日志查询、多种KPI监控、异常发现、故障预警等功能,另一方面跟踪KPI监控的结果、异常规则过滤的结果、阈值超限的结果、异常模式自动发现的结果,并根据新发现的异常及故障数据、已有异常及故障的数据标注情况调用模型自动训练及评估模块生成的模型或者知识库中的一种或者几种,实现故障的快速定位并给出相应的结果。运维管理及任务执行模块负责运维结果的输出。
其中,运维管理及任务执行模块继续监测新的日志数据的采集情况,依次启动异常及故障标注模块、模型自动训练及评估模块,生成新的模型及知识库或者对已有模型及知识库进行更新,后续迭代性地执行运维任务、进行结果审核等等,从而实现系统运维能力的自我更新、迭代和进化。
其中,结果审核模块,设置为对运维管理及任务执行模块输出的异常及故障进行审核,并将审核确认后的异常及故障输出到异常及故障标注模块。
结果审核模块主要负责对运维管理模块生成的运维结果进行人工审核及确认,将确认有效的异常及故障传入异常及故障标注模块中作为一种数据标注方式不断扩充、累积标注数据。
本申请实施例二提供的技术方案,在日志数据类型及形态多样、运维需求复杂情况下可以高效地进行异常自动发现、故障快速定位、故障提前预警等,而且整个智能运维系统可以实现自适应更新、自我迭代、逐步进化。
实施例三
图4为本申请实施例三提供的运维方法的流程示意图。如图4所示,该运维方法,包括步骤401至步骤406。
在步骤401中,数据采集模块采集运维系统所需要的多种日志源数据,并将所述多种日志源数据存储在数据存储模块。
在步骤402中,数据存储模块对所述日志源数据、运维结果、标注结果、模型及知识库进行存储。
在步骤403中,异常及故障标注模块持续地对数据存储模块中的部分源数据进行异常及故障标注,并将标注结果存储到数据存储模块。
在步骤404中,模型自动训练及评估模块持续地生成并更新多种运维模型及知识库,并将所述多种运维模型和知识库存储到数据存储模块。
在步骤405中,运维管理及任务执行模块对运维任务进行设置并执行运维任务、调用运维模型及知识库、存储及输出运维结果。
在步骤406中,结果审核模块对运维管理及任务执行模块输出的异常及故障进行审核,并将审核确认后的异常及故障输出到异常及故障标注模块。
在一实施例中,所述数据采集模块采集的多种日志源数据,包括:应用系统日志、操作系统资源状态日志、异常日志数据、流式日志数据、运维详细记录,以及第三方标注数据;
其中,针对应用系统日志和操作系统资源状态日志,数据采集模块采用定时扫描、批量传输的数据采集模式,针对异常日志数据和流式日志数据,数据采集模块采用实时采集、实时传输的数据采集模式,针对运维详细记录和第三方标注数据,数据采集模块采用定时扫描、批量传输的数据采集模式。
所述异常及故障标注模块进行异常及故障标注的源数据,包括:数据存储模块中存储的运维详细记录和第三方标注数据、结果审核模块输出的审核确认后的异常及故障、模型自动训练及评估模块用来作为训练测试验证的数据。
在一实施例中,所述异常及故障标注模块进行异常及故障标注的方式,包括:人工、半人工、半监督学习、迁移学习四种方式。
所述人工方式,表示按照故障发生模块、故障类型、故障原因对数据存储模块中按需抽取的运维详细记录进行标注。
所述半人工方式,表示按故障发生模块、故障类型、故障原因对结果审核模块输出的审核确认后的异常及故障进行标注。
所述半监督学习方式,表示利用半监督学习算法、部分已经标注好的标注样本对模型自动训练及评估模块用来作为训练测试验证的数据进行标注。
所述迁移学习方式,表示利用迁移学习技术对相近的第三方标注数据学习生成所述运维系统所需的标注数据。
在一实施例中,所述模型自动训练及评估模块生成并更新多种运维模型及知识库的方式包括:实时数据处理建模及评估、批量数据处理建模及评估。
所述实时数据处理建模及评估,表示从数据存储模块中的实时数据库中抽取实时日志数据,按照实时任务需求对数据进行处理,按时间顺序对数据进行排序、按照特定时间窗口对数据进行时间切分,然后利用简单的关系判定、统计分析对异常模式进行发现与抽取。
所述批量数据处理建模及评估,表示按照运维任务的不同及标注数据准备情况,对发现和抽取的异常模式选择单模型训练及评估、集成模型训练及评估和增量模型训练及评估,生成多种运维模型及知识库。
其中,所述的单模型训练及评估,包含:根据任务类型、异常及故障标注数据、训练测试及验证数据,从监督模型、非监督模型、半监督模型三类中选择合适的算法,进行训练和评估生成单模型。
所述的集成模型训练及评估,包含:当单模型训练及评估的结果不稳定时将多个单模型采用合适的集成模式以获取稳定的最优结果。
所述的增量模式训练及评估,包含:新日志数据到来后需要对已有运维模型进行模型参数更新、模型重新训练及评估更新操作。
在一实施例中,多种运维模型及知识库,包含:实时模型、通用模型、知识库、增量模型。
其中,所述实时模型,包含:用于实时计算场景下发现日志数据中的简单异常规则、静态阈值参数。
所述通用模型,包含:用于批量计算场景下单模型训练及评估后形成的算法及相应的参数、随时间及数据变化的动态阈值,集成模型及评估后形成的集成框架及相应的算法及参数。
所述知识库,包含:模型综合训练及评估中各阶段发现的复杂规则、关联关系、链路传播图、知识图谱、故障树。
所述增量模型,包含:已有运维模型如何适应新的数据而做出的模型参数调整、模型类型调整,既包含单模型的增量,也包含集成模型的增量。
下面通过一个具体的实施例详细阐述本申请实施例三提供的技术方案。
实施例四
图5为本申请实施例四提供的运维方法的流程示意图。如图5所示,该运维方法,包括步骤501至步骤506。
在步骤501中,数据采集。
其中,该数据采集包括:1、异常日志实时采集,对工作集群中重要应用/操作的异常日志进行实时采集;2、操作系统资源状态采集,对应用系统日志批量采集;4、第三方标注数据采集,设置为补充运维系统异常及故障标注数据的匮乏,主要通过迁移学习技术对外界相近的标注数据进行迁移实现;5、运维详细记录表采集,直接作为运维系统异常及故障的标注数据;6、流式日志数据采集,主要采集大数据环境中类似交易型、实时传输型/操作型的流式数据进行实时采集。
在步骤502中,数据存储。
其中,该数据存储包括:1、日志源数据存储,一般存储在文件系统中;2、日志分析结果存储,一般存储在数据库或数据仓库中;3、标注结果存储,设置为保存异常及故障标注模中生成的标注结果;4、模型及知识库存储,设置为保存模型自动训练及评估模块中生成的多种模型和知识库。
在步骤503中,异常及故障标注。
其中,所述异常及故障标注,包括:1、异常事件标注,对系统中采集到的异常数据进行确认,标注出真正异常的事件;2、故障类型标注,对系统中采集到的故障数据及故障类型进行标注;3、标注数据迁移,对第三方标注数据通过迁移学习技术形成系统可用的异常及故障标注数据。
在步骤504中,模型自动训练及评估。
其中,模型自动训练及评估,包括:1、数据预处理,负责模型自动训练及评估模块中的数据准备工作,包含但不限于样本数据抽取、数据解析及格式统一、特征提取及构造、数据不平衡处理等等;2、单模型训练及评估,根据数据存储模块中日志源数据存储的现状、标注结果存储的现状、任务类型(异常自动发现/故障快速定位/故障提前预警)选择单模型训练及评估中的非监督模型训练及评估、半监督模型训练及评估、监督模型训练及评估中的一种或者多种算法进行训练、测试及评估,并将形成的的算法及参数、关联关系、链路传播、复杂规则、知识图谱、故障树等作为模型或者知识库存入数据存储模块的模型及知识库存储中;3、集成模型训练及评估,在单模型训练及评估的基础上,根据模型的稳定性及评估效果可继续选择集成模型训练及评估进行模型的优化;4、增量模型训练及评估,随着采集数据的不断增加,需要对已有模型及知识库进行更新,可以通过增量模型训练及评估实现。
在步骤505中,运维管理及任务执行。
其中,其中,运维管理及任务执行包括:日志查询及KPI监控任务执行及结果展示、故障手动定位及结果的展示、异常规则过滤的执行及结果的展示、静态阈值的设定及执行结果的展示、异常自动发现相关模型的调用及结果的展示、故障快速定位相关模型的调用及结果的展示、故障提前预警相关模型的调用及结果的展示、模型自动训练及评估模块的启动及结果的管理、多种模型的分类管理及更新等等。
在步骤506中,结果审核。
其中,所述结果审核包括:对运维管理及任务执行模块中的异常及故障相关的结果进行审核,一方面将审核后的所有的异常及故障输出,另一方面将确认的异常及故障传入异常及故障标注模块中。
其中,上述各个步骤可以周期性循环,并不存在固定的顺序。
本实施例三、四提供的运维方法可以应用于上述实施例一、二提供的运维系统中。
本申请实施例三、四提供的技术方案,在日志数据类型及形态多样、运维需求复杂情况下可以高效地进行异常自动发现、故障快速定位、故障提前预警等,而且整个智能运维系统可以实现自适应更新、自我迭代、逐步进化。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在设置为存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于随机存取存储器(random access memory,RAM)、只读存储器(Read Only Memory,ROM)、带电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、闪存或其他存储器技术、光盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘数字多功能光盘(Digital Video Disc,DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以设置为存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计 算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。

Claims (10)

  1. 一种运维系统,包括:相互连接的数据采集模块、数据存储模块、异常及故障标注模块、模型自动训练及评估模块、运维管理及任务执行模块、结果审核模块;其中,
    数据采集模块,设置为采集所述运维系统所需要的多种日志源数据,并将所述多种日志源数据存储在数据存储模块;
    数据存储模块,设置为对所述日志源数据、运维结果、标注结果、模型及知识库进行存储;
    异常及故障标注模块,设置为持续地对数据存储模块中的部分源数据进行异常及故障标注,并将标注结果存储到数据存储模块;
    模型自动训练及评估模块,设置为持续地生成并更新多种运维模型和知识库,并将所述多种运维模型和知识库存储到数据存储模块;
    运维管理及任务执行模块,设置为对运维任务进行设置并执行运维任务、调用运维模型和知识库、存储和输出运维结果;
    结果审核模块,设置为对运维管理及任务执行模块输出的异常及故障进行审核,并将审核确认后的异常及故障输出到异常及故障标注模块。
  2. 根据权利要求1所述的运维系统,其中,
    所述数据采集模块采集的多种日志源数据,包括:应用系统日志、操作系统资源状态日志、异常日志数据、流式日志数据、运维详细记录,以及第三方标注数据;
    其中,针对应用系统日志和操作系统资源状态日志,数据采集模块采用定时扫描、批量传输的数据采集模式;针对异常日志数据和流式日志数据,数据采集模块采用实时采集、实时传输的数据采集模式;针对运维详细记录和第三方标注数据,数据采集模块采用定时扫描、批量传输的数据采集模式;
    所述异常及故障标注模块进行异常及故障标注的源数据,包括:数据存储模块中存储的运维详细记录和第三方标注数据、结果审核模块输出的审核确认后的异常及故障、模型自动训练及评估模块用来作为训练测试验证的数据。
  3. 根据权利要求1所述的运维系统,其中,
    所述异常及故障标注模块进行异常及故障标注的方式,包括:人工方式、半人工方式、半监督学习方式,以及迁移学习方式;
    所述人工方式,表示按照故障发生模块、故障类型、故障原因对数据存储模块中按需抽取的运维详细记录进行标注;
    所述半人工方式,表示按故障发生模块、故障类型、故障原因对结果审核模块输出的审核确认后的异常及故障进行标注;
    所述半监督学习方式,表示利用半监督学习算法、部分已经标注好的标注样本,对模型自动训练及评估模块用来作为训练测试验证的数据进行标注;
    所述迁移学习方式,表示利用迁移学习技术对相近的第三方标注数据学习生成所述运维系统所需的标注数据。
  4. 根据权利要求1所述的运维系统,其中,
    所述模型自动训练及评估模块生成并更新多种运维模型及知识库的方式包括:实时数据处理建模及评估、批量数据处理建模及评估;
    所述实时数据处理建模及评估,表示从数据存储模块中的实时数据库中抽取实时日志数据,按照实时任务需求对数据进行处理,按时间顺序对数据进行排序、按照特定时间窗口对数据进行时间切分,利用关系判定和统计分析对异常模式进行发现与抽取;
    所述批量数据处理建模及评估,表示按照运维任务的不同及标注数据准备情况,对发现和抽取的异常模式选择单模型训练及评估、集成模型训练及评估和增量模型训练及评估,生成多种运维模型及知识库;
    其中,所述的单模型训练及评估,包括:根据任务类型、异常及故障标注数据、训练测试及验证数据,从监督模型、非监督模型、半监督模型中选择相应的模型,进行训练和评估生成单模型;
    所述的集成模型训练及评估,包括:在单模型训练及评估的结果不稳定时的情况下,对多个单模型采用相应的集成模式以获取稳定的结果;
    所述的增量模式训练及评估,包括:新日志数据到来后需要对已有运维模型进行模型参数更新、模型重新训练及评估更新操作。
  5. 根据权利要求4所述的运维系统,其中,
    多种运维模型及知识库,包括:实时模型、通用模型、知识库、增量模型;
    其中,所述实时模型,包括:用于实时计算场景下发现日志数据中的简单异常规则、静态阈值参数;
    所述通用模型,包括:用于批量计算场景下单模型训练及评估后形成的算法及相应的参数、随时间及数据变化的动态阈值,集成模型及评估后形成的集成框架及相应的算法及参数;
    所述知识库,包括:模型综合训练及评估中各阶段发现的复杂规则、关联关系、链路传播图、知识图谱,以及故障树;
    所述增量模型,包括:已有运维模型如何适应新的数据而做出的模型参数调整、模型类型调整,所述增量模型包括单模型的增量,以及集成模型的增量。
  6. 一种运维方法,包括:
    数据采集模块采集运维系统所需要的多种日志源数据,并将所述多种日志源数据存储在数据存储模块;
    数据存储模块对所述日志源数据、运维结果、标注结果、模型及知识库进行存储;
    异常及故障标注模块持续地对数据存储模块中的部分源数据进行异常及故障标注,并将标注结果存储到数据存储模块;
    模型自动训练及评估模块持续地生成并更新多种运维模型和知识库,并将所述多种运维模型和知识库存储到数据存储模块;
    运维管理及任务执行模块对运维任务进行设置并执行运维任务、调用运维模型及知识库、存储及输出运维结果;
    结果审核模块对运维管理及任务执行模块输出的异常及故障进行审核,并将审核确认后的异常及故障输出到异常及故障标注模块。
  7. 根据权利要求6所述的运维方法,其中,
    所述数据采集模块采集的多种日志源数据,包括:应用系统日志、操作系统资源状态日志、异常日志数据、流式日志数据、运维详细记录,以及第三方标注数据;
    其中,针对应用系统日志和操作系统资源状态日志,数据采集模块采用定时扫描、批量传输的数据采集模式;针对异常日志数据和流式日志数据,数据采集模块采用实时采集、实时传输的数据采集模式;针对运维详细记录和第三方标注数据,数据采集模块采用定时扫描、批量传输的数据采集模式;
    所述异常及故障标注模块进行异常及故障标注的源数据,包括:数据存储模块中存储的运维详细记录和第三方标注数据、结果审核模块输出的审核确认后的异常及故障、模型自动训练及评估模块用来作为训练测试验证的数据。
  8. 根据权利要求6所述的运维方法,其中,
    所述异常及故障标注模块进行异常及故障标注的方式,包括:人工方式、半人工方式、半监督学习方式、迁移学习方式;
    所述人工方式,表示按照故障发生模块、故障类型、故障原因对数据存储模块中按需抽取的运维详细记录进行标注;
    所述半人工方式,表示按故障发生模块、故障类型、故障原因对结果审核模块输出的审核确认后的异常及故障进行标注;
    所述半监督学习方式,表示利用半监督学习算法、部分已经标注好的标注样本,对模型自动训练及评估模块用来作为训练测试验证的数据进行标注;
    所述迁移学习方式,表示利用迁移学习技术对相近的第三方标注数据学习生成所述运维系统所需的标注数据。
  9. 根据权利要求6所述的运维方法,其中,
    所述模型自动训练及评估模块生成并更新多种运维模型及知识库的方式包括:实时数据处理建模及评估、批量数据处理建模及评估;
    所述实时数据处理建模及评估,表示从数据存储模块中的实时数据库中抽取实时日志数据,按照实时任务需求对数据进行处理,按时间顺序对数据进行 排序、按照特定时间窗口对数据进行时间切分,利用关系判定和统计分析对异常模式进行发现与抽取;
    所述批量数据处理建模及评估,表示按照运维任务的不同及标注数据准备情况,对发现和抽取的异常模式选择单模型训练及评估、集成模型训练及评估和增量模型训练及评估,生成多种运维模型及知识库;
    其中,所述的单模型训练及评估,包括:根据任务类型、异常及故障标注数据、训练测试及验证数据,从监督模型、非监督模型、半监督模型三类中选择相应的模型,进行训练和评估生成单模型;
    所述的集成模型训练及评估,包括:在单模型训练及评估的结果的情况下,不稳定时将对多个单模型采用相应的集成模式以获取稳定的结果;
    所述的增量模式训练及评估,包括:新日志数据到来后需要对已有运维模型进行模型参数更新、模型重新训练及评估更新操作。
  10. 根据权利要求9所述的运维方法,其中,
    多种运维模型及知识库,包括:实时模型、通用模型、知识库、增量模型;
    其中,所述实时模型,包括:用于实时计算场景下发现日志数据中的简单异常规则、静态阈值参数;
    所述通用模型,包括:用于批量计算场景下单模型训练及评估后形成的算法及相应的参数、随时间及数据变化的动态阈值,集成模型及评估后形成的集成框架及相应的算法及参数;
    所述知识库,包括:模型综合训练及评估中各阶段发现的复杂规则、关联关系、链路传播图、知识图谱,以及故障树;
    所述增量模型,包括:已有运维模型如何适应新的数据而做出的模型参数调整、模型类型调整,所述增量模型包括单模型的增量,以及集成模型的增量。
PCT/CN2019/093812 2018-06-28 2019-06-28 一种运维系统及方法 WO2020001642A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020217001839A KR102483025B1 (ko) 2018-06-28 2019-06-28 운영 유지 시스템 및 방법
EP19826453.3A EP3798846B1 (en) 2018-06-28 2019-06-28 Operation and maintenance system and method
US17/256,618 US11947438B2 (en) 2018-06-28 2019-06-28 Operation and maintenance system and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810689427.5A CN110659173B (zh) 2018-06-28 2018-06-28 一种运维系统及方法
CN201810689427.5 2018-06-28

Publications (1)

Publication Number Publication Date
WO2020001642A1 true WO2020001642A1 (zh) 2020-01-02

Family

ID=68985827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/093812 WO2020001642A1 (zh) 2018-06-28 2019-06-28 一种运维系统及方法

Country Status (5)

Country Link
US (1) US11947438B2 (zh)
EP (1) EP3798846B1 (zh)
KR (1) KR102483025B1 (zh)
CN (1) CN110659173B (zh)
WO (1) WO2020001642A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181960A (zh) * 2020-09-18 2021-01-05 杭州优云软件有限公司 一种基于AIOps的智能运维框架系统
CN112511213A (zh) * 2020-11-18 2021-03-16 四川安迪科技实业有限公司 基于日志分析的缺陷定位方法及系统
CN112766599A (zh) * 2021-01-29 2021-05-07 广州源创动力科技有限公司 一种基于深度强化学习的智能运维方法
CN112910691A (zh) * 2021-01-19 2021-06-04 中国工商银行股份有限公司 机房故障检测方法及装置
CN113179173A (zh) * 2020-09-29 2021-07-27 北京速通科技有限公司 一种用于高速公路系统的运维监控系统
CN113359664A (zh) * 2021-05-31 2021-09-07 海南文鳐科技有限公司 故障诊断与维护系统、方法、设备及存储介质
CN113516360A (zh) * 2021-05-16 2021-10-19 中国建材检验认证集团云南合信有限公司 检测机构的检测仪器设备管理信息化系统及管理方法
CN113651245A (zh) * 2021-08-16 2021-11-16 合肥市春华起重机械有限公司 一种起重机承载力监测系统
CN114594737A (zh) * 2020-12-07 2022-06-07 北京福田康明斯发动机有限公司 一种监控发动机装配过程的优化方法及装置
CN114880151A (zh) * 2022-04-25 2022-08-09 北京科杰科技有限公司 人工智能运维方法
CN114897196A (zh) * 2022-05-11 2022-08-12 山东大卫国际建筑设计有限公司 一种办公建筑供水网络的运行管理方法、设备及介质
CN116305699A (zh) * 2023-05-11 2023-06-23 青岛研博数据信息技术有限公司 一种基于全方位感知的管道监督系统
CN116760691A (zh) * 2023-07-06 2023-09-15 武昌理工学院 一种基于大数据技术的电信故障排除系统
CN117194201A (zh) * 2023-11-07 2023-12-08 中央军委政治工作部军事人力资源保障中心 一种业务系统的健康度评估及观测方法、装置
CN117670312A (zh) * 2024-01-30 2024-03-08 北京伽睿智能科技集团有限公司 一种远程辅助的设备故障维护系统

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7184099B2 (ja) * 2019-02-01 2022-12-06 日本電気株式会社 運用支援装置、システム、方法及びプログラム
US11341017B1 (en) 2019-04-24 2022-05-24 Snap Inc. Staged release of updates with anomaly monitoring
CN112887119B (zh) * 2019-11-30 2022-09-16 华为技术有限公司 故障根因确定方法及装置、计算机存储介质
CN111259947A (zh) * 2020-01-13 2020-06-09 国网浙江省电力有限公司信息通信分公司 一种基于多模态学习的电力系统故障预警方法和系统
CN111541580A (zh) * 2020-03-23 2020-08-14 广东工业大学 一种应用于工业互联网的自适应异常检测系统
CN111611327A (zh) * 2020-05-28 2020-09-01 孙明松 一种运维数据处理的方法及装置
CN112084055A (zh) * 2020-08-19 2020-12-15 广州小鹏汽车科技有限公司 应用系统的故障定位方法、装置、电子设备及存储介质
CN114330760A (zh) * 2020-09-29 2022-04-12 领值(上海)信息技术有限公司 一种设备运维管理方法及系统
CN112269821A (zh) * 2020-10-30 2021-01-26 内蒙古电力(集团)有限责任公司乌海超高压供电局 一种基于大数据的电力设备状态分析方法
CN112804079B (zh) * 2020-12-10 2023-04-07 北京浪潮数据技术有限公司 云计算平台告警分析方法、装置、设备及存储介质
CN112711757B (zh) * 2020-12-23 2022-09-16 光大兴陇信托有限责任公司 一种基于大数据平台的数据安全集中管控方法及系统
CN112783865A (zh) * 2021-01-29 2021-05-11 杭州优云软件有限公司 一种半监督人机结合的运维故障库生成方法及系统
CN113313280B (zh) * 2021-03-31 2023-09-19 阿里巴巴新加坡控股有限公司 云平台的巡检方法、电子设备及非易失性存储介质
CN113077289B (zh) * 2021-04-12 2022-08-19 上海耶汇市场营销策划有限公司 一种用于产品营销的社交平台运维系统
CN113204199A (zh) * 2021-04-26 2021-08-03 武汉卓尔信息科技有限公司 一种工业设备的远程运维系统及方法
CN113258678A (zh) * 2021-06-03 2021-08-13 长沙理工大学 一种智能配电柜故障抢修系统、方法及装置
CN113268891B (zh) * 2021-06-30 2022-06-03 云智慧(北京)科技有限公司 一种运维系统的建模方法和装置
US11868971B2 (en) * 2021-08-02 2024-01-09 Arch Systems Inc. Method for manufacturing system analysis and/or maintenance
CN113672427A (zh) * 2021-08-26 2021-11-19 北京来也网络科技有限公司 基于rpa及ai的异常处理方法、装置、设备及介质
CN115905417A (zh) * 2021-09-29 2023-04-04 中兴通讯股份有限公司 一种系统异常检测处理方法及装置
TWI806220B (zh) * 2021-11-04 2023-06-21 財團法人資訊工業策進會 異常評估系統與異常評估方法
CN114048365B (zh) * 2021-11-15 2022-10-21 江苏鼎驰电子科技有限公司 一种基于大数据流处理技术的运维监控治理方法
CN114205216B (zh) * 2021-12-07 2024-02-06 中国工商银行股份有限公司 微服务故障的根因定位方法、装置、电子设备和介质
CN114371687A (zh) * 2021-12-30 2022-04-19 苏州胜科设备技术有限公司 一种伺服驱动器快速测试方法
CN114912637B (zh) * 2022-05-21 2023-08-29 重庆大学 人机物知识图谱制造产线运维决策方法及系统、存储介质
CN115051930B (zh) * 2022-05-23 2023-05-12 中电信数智科技有限公司 基于AISecOps结合中台算法的弊端优化方法
CN114875999B (zh) * 2022-05-27 2023-11-21 上海威派格智慧水务股份有限公司 一种用于二次供水系统的泵房运维管理系统
CN114969163B (zh) * 2022-07-21 2022-12-09 北京宏数科技有限公司 一种基于大数据的设备运维方法及系统
CN115695150B (zh) * 2022-11-01 2023-08-08 广州城轨科技有限公司 一种基于分布式异构融合组网设备检测方法及装置
KR102541576B1 (ko) * 2023-02-06 2023-06-14 주식회사 마티아솔루션 머신비전 판정 모델의 서빙 시스템
CN116163943B (zh) * 2023-03-27 2023-09-08 蚌埠市联合压缩机制造有限公司 一种运行状态实时监测的压缩机
CN116187725B (zh) * 2023-04-27 2023-08-04 武汉新威奇科技有限公司 一种用于锻造自动线的锻造设备管理系统
CN116841792B (zh) * 2023-08-29 2023-11-17 北京轻松致远科技有限责任公司 一种应用程序开发故障修复方法
CN117325879B (zh) * 2023-10-07 2024-04-05 盐城工学院 一种四轮分布式电驱动汽车状态评估方法及系统
CN117709755B (zh) * 2024-02-04 2024-05-10 深圳市安达新材科技有限公司 一种基于云计算的光学膜片数据管理系统及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105204978A (zh) * 2015-06-23 2015-12-30 北京百度网讯科技有限公司 基于机器学习的数据中心运行数据分析系统
CN106844138A (zh) * 2016-12-14 2017-06-13 北京奇艺世纪科技有限公司 运维报警系统及方法
CN108038049A (zh) * 2017-12-13 2018-05-15 西安电子科技大学 实时日志控制系统及控制方法、云计算系统及服务器
CN108173671A (zh) * 2016-12-07 2018-06-15 博彦科技股份有限公司 运维方法、装置及系统

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014115768A (ja) 2012-12-07 2014-06-26 Toshiba Corp ログ判定システム、ログ判定基準構築装置及びログ判定方法
CN103346906B (zh) 2013-06-19 2016-07-13 华南师范大学 一种基于云计算的智能运维方法及系统
US10410135B2 (en) * 2015-05-21 2019-09-10 Software Ag Usa, Inc. Systems and/or methods for dynamic anomaly detection in machine sensor data
US10361919B2 (en) * 2015-11-09 2019-07-23 At&T Intellectual Property I, L.P. Self-healing and dynamic optimization of VM server cluster management in multi-cloud platform
CN106452829B (zh) 2016-01-21 2019-07-19 华南师范大学 一种基于bcc-knn的云计算中心智能运维方法及系统
US10769641B2 (en) * 2016-05-25 2020-09-08 Microsoft Technology Licensing, Llc Service request management in cloud computing systems
CN106095639A (zh) * 2016-05-30 2016-11-09 中国农业银行股份有限公司 一种集群亚健康预警方法及系统
CN106649034B (zh) * 2016-11-22 2020-08-28 北京锐安科技有限公司 一种可视化智能运维方法及平台
CN106600115A (zh) * 2016-11-28 2017-04-26 湖北华中电力科技开发有限责任公司 一种企业信息系统运维智能分析方法
KR101758870B1 (ko) 2017-02-13 2017-07-18 주식회사 온더 마이닝 관리 시스템 및 이를 이용한 마이닝 관리 방법
CN107332685A (zh) * 2017-05-22 2017-11-07 国网安徽省电力公司信息通信分公司 国网云中应用的一种基于大数据运维日志的方法
CN107358300A (zh) 2017-06-19 2017-11-17 北京至信普林科技有限公司 一种基于多平台自主预测的智能运维告警过滤方法及系统
CN107577588B (zh) 2017-09-26 2021-04-09 北京中安智达科技有限公司 一种海量日志数据智能运维系统
KR101856543B1 (ko) 2018-02-26 2018-05-11 주식회사 리앙커뮤니케이션즈 인공지능 기반의 장애 예측 시스템

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105204978A (zh) * 2015-06-23 2015-12-30 北京百度网讯科技有限公司 基于机器学习的数据中心运行数据分析系统
CN108173671A (zh) * 2016-12-07 2018-06-15 博彦科技股份有限公司 运维方法、装置及系统
CN106844138A (zh) * 2016-12-14 2017-06-13 北京奇艺世纪科技有限公司 运维报警系统及方法
CN108038049A (zh) * 2017-12-13 2018-05-15 西安电子科技大学 实时日志控制系统及控制方法、云计算系统及服务器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3798846A4 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181960A (zh) * 2020-09-18 2021-01-05 杭州优云软件有限公司 一种基于AIOps的智能运维框架系统
CN112181960B (zh) * 2020-09-18 2022-05-31 杭州优云软件有限公司 一种基于AIOps的智能运维框架系统
CN113179173B (zh) * 2020-09-29 2024-03-22 北京速通科技有限公司 一种用于高速公路系统的运维监控系统
CN113179173A (zh) * 2020-09-29 2021-07-27 北京速通科技有限公司 一种用于高速公路系统的运维监控系统
CN112511213B (zh) * 2020-11-18 2022-07-22 四川安迪科技实业有限公司 基于日志分析的缺陷定位方法及系统
CN112511213A (zh) * 2020-11-18 2021-03-16 四川安迪科技实业有限公司 基于日志分析的缺陷定位方法及系统
CN114594737A (zh) * 2020-12-07 2022-06-07 北京福田康明斯发动机有限公司 一种监控发动机装配过程的优化方法及装置
CN112910691A (zh) * 2021-01-19 2021-06-04 中国工商银行股份有限公司 机房故障检测方法及装置
CN112766599A (zh) * 2021-01-29 2021-05-07 广州源创动力科技有限公司 一种基于深度强化学习的智能运维方法
CN113516360A (zh) * 2021-05-16 2021-10-19 中国建材检验认证集团云南合信有限公司 检测机构的检测仪器设备管理信息化系统及管理方法
CN113516360B (zh) * 2021-05-16 2023-06-30 国检测试控股集团云南有限公司 检测机构的检测仪器设备管理信息化系统及管理方法
CN113359664A (zh) * 2021-05-31 2021-09-07 海南文鳐科技有限公司 故障诊断与维护系统、方法、设备及存储介质
CN113651245A (zh) * 2021-08-16 2021-11-16 合肥市春华起重机械有限公司 一种起重机承载力监测系统
CN113651245B (zh) * 2021-08-16 2023-07-21 合肥市春华起重机械有限公司 一种起重机承载力监测系统
CN114880151A (zh) * 2022-04-25 2022-08-09 北京科杰科技有限公司 人工智能运维方法
CN114880151B (zh) * 2022-04-25 2023-01-13 北京科杰科技有限公司 人工智能运维方法
CN114897196A (zh) * 2022-05-11 2022-08-12 山东大卫国际建筑设计有限公司 一种办公建筑供水网络的运行管理方法、设备及介质
CN114897196B (zh) * 2022-05-11 2023-01-13 山东大卫国际建筑设计有限公司 一种办公建筑供水网络的运行管理方法、设备及介质
CN116305699A (zh) * 2023-05-11 2023-06-23 青岛研博数据信息技术有限公司 一种基于全方位感知的管道监督系统
CN116305699B (zh) * 2023-05-11 2023-08-18 青岛研博数据信息技术有限公司 一种基于全方位感知的管道监督系统
CN116760691A (zh) * 2023-07-06 2023-09-15 武昌理工学院 一种基于大数据技术的电信故障排除系统
CN117194201A (zh) * 2023-11-07 2023-12-08 中央军委政治工作部军事人力资源保障中心 一种业务系统的健康度评估及观测方法、装置
CN117670312A (zh) * 2024-01-30 2024-03-08 北京伽睿智能科技集团有限公司 一种远程辅助的设备故障维护系统
CN117670312B (zh) * 2024-01-30 2024-04-26 北京伽睿智能科技集团有限公司 一种远程辅助的设备故障维护系统

Also Published As

Publication number Publication date
US11947438B2 (en) 2024-04-02
EP3798846A4 (en) 2021-07-28
EP3798846A1 (en) 2021-03-31
KR20210019564A (ko) 2021-02-22
KR102483025B1 (ko) 2022-12-29
US20210271582A1 (en) 2021-09-02
EP3798846B1 (en) 2022-09-07
CN110659173A (zh) 2020-01-07
CN110659173B (zh) 2023-05-26

Similar Documents

Publication Publication Date Title
WO2020001642A1 (zh) 一种运维系统及方法
CN108521339B (zh) 一种基于集群日志的反馈式节点故障处理方法及系统
CN112152830A (zh) 一种智能的故障根因分析方法及系统
WO2022083576A1 (zh) 一种网络功能虚拟化设备运行数据的分析方法及装置
CN111400198B (zh) 一种自适应的软件测试系统
CN109902084B (zh) 一种全自动检测与分析数据质量的系统及方法
CN113542039A (zh) 一种通过ai算法定位5g网络虚拟化跨层问题的方法
CN111259073A (zh) 基于日志、流量和业务访问的业务系统运行状态智能研判系统
CN112769605B (zh) 一种异构多云的运维管理方法及混合云平台
CN110489317B (zh) 基于工作流的云系统任务运行故障诊断方法与系统
US11704186B2 (en) Analysis of deep-level cause of fault of storage management
JPWO2019116418A1 (ja) 障害分析装置、障害分析方法および障害分析プログラム
CN114296975A (zh) 一种分布式系统调用链和日志融合异常检测方法
Xie et al. Logm: Log analysis for multiple components of hadoop platform
CN117390529A (zh) 多因素溯源的数据中台信息管理方法
CN117234785B (zh) 基于人工智能自查询的集控平台错误分析系统
CN116611813B (zh) 一种基于知识图谱的智能运维管理方法及系统
KR20220041600A (ko) 스마트공장 데이터 품질평가 방법
Li et al. MicroSketch: Lightweight and adaptive sketch based performance issue detection and localization in microservice systems
CN111522705A (zh) 一种工业大数据智能运维解决方法
WO2023224764A1 (en) Multi-modality root cause localization for cloud computing systems
CN115080286A (zh) 一种网络设备日志异常的发现方法及装置
CN112948154A (zh) 一种系统异常诊断方法、装置及存储介质
CN117170724A (zh) 用于检测业务异常的ai模型自动化更新方法、装置及设备
CN112968941B (zh) 一种基于边缘计算的数据采集和人机协同标注方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19826453

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019826453

Country of ref document: EP

Effective date: 20201222

ENP Entry into the national phase

Ref document number: 20217001839

Country of ref document: KR

Kind code of ref document: A