CN118093311A - Intelligent fault healing method, device, equipment, storage medium and product - Google Patents

Intelligent fault healing method, device, equipment, storage medium and product Download PDF

Info

Publication number
CN118093311A
CN118093311A CN202410187094.1A CN202410187094A CN118093311A CN 118093311 A CN118093311 A CN 118093311A CN 202410187094 A CN202410187094 A CN 202410187094A CN 118093311 A CN118093311 A CN 118093311A
Authority
CN
China
Prior art keywords
word frequency
log
fault
cure
healing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410187094.1A
Other languages
Chinese (zh)
Inventor
苏龙华
戴建东
杭跃斌
孙彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Jiangsu Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Jiangsu Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Jiangsu Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202410187094.1A priority Critical patent/CN118093311A/en
Publication of CN118093311A publication Critical patent/CN118093311A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses an intelligent fault cure method, a device, equipment, a storage medium and a computer program product, wherein the method is implemented by collecting application service log information of an application log file; configuring a log word frequency analysis strategy according to the application service log information; performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes; training according to the word frequency monitoring index to obtain a word frequency detection model; and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result. In this way, the log data summarizing and exposing all applications identifies potential problems and anomalies and enables targeted fault healing. Abnormality and fault can be found in time, and response and processing can be fast. The requirement of manual intervention is reduced, the operation and maintenance efficiency and accuracy are improved, and the performance and user experience of the system are improved.

Description

Intelligent fault healing method, device, equipment, storage medium and product
Technical Field
The invention relates to the technical field of cloud platforms, in particular to an intelligent fault healing method, an intelligent fault healing device, intelligent fault healing equipment, an intelligent fault healing storage medium and a computer program product.
Background
When the cloud platform system is in fault or abnormal state, operation and maintenance personnel need to spend a long time for troubleshooting, repairing and recovering, and timely inform related personnel, which usually requires shutdown maintenance or affects service operation, so that service continuity is affected, which not only consumes time and energy, but also is easy to cause operation errors or omission. This approach has certain limitations in terms of efficiency, reliability, and scalability, and cannot meet the requirements of modern enterprises for efficiency, stability, and elasticity.
Disclosure of Invention
The invention mainly aims to provide an intelligent fault healing method, device, equipment, storage medium and computer program product, and aims to solve the technical problem of low timeliness of fault treatment in the prior art.
To achieve the above object, the present invention provides an intelligent fault cure method, comprising the steps of:
configuring a log word frequency analysis strategy according to application service log information of an acquired application log file;
performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes;
Training according to the word frequency monitoring index to obtain a word frequency detection model;
and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result.
Optionally, the applying log word frequency analysis according to the log word frequency analysis policy to obtain a word frequency monitoring indicator includes:
inputting the application service log information into a Kafka cluster, and consuming the cluster application log of the Kafka cluster according to a stream processing application program library to obtain consumption data;
Determining a log word rule according to the log word frequency analysis strategy;
And generating a log word frequency monitoring index according to the consumption data and the log word rule.
Optionally, the generating a log word frequency monitoring indicator according to the consumption data and the log word rule includes:
comparing the consumption data with the log word rule;
Determining compliance log data according to the comparison result;
And storing the compliance log data into a target cluster, and generating a log word frequency monitoring index through a log word frequency statistics service and cluster data of the target cluster.
Optionally, training according to the word frequency monitoring index to obtain a word frequency detection model includes:
Acquiring a historical fault record, and generating a training sample according to the historical fault record and the word frequency monitoring index;
and training according to the training sample to obtain a word frequency detection model.
Optionally, the generating training samples according to the historical fault record and the word frequency monitoring index includes:
Determining an original index data time sequence according to the word frequency monitoring index;
dividing the original index data time sequence into a plurality of sample windows in a sliding window mode;
Determining an abnormal window according to the historical fault record;
determining a normal window according to the sample window and the abnormal window;
And determining training samples according to the normal window.
Optionally, the performing anomaly detection through the word frequency detection model and performing fault cure according to the detection result includes:
inputting the prediction sample into the word frequency detection model to obtain an output abnormal sample;
In addition, in order to achieve the above object, the present invention also provides an intelligent fault healing apparatus comprising:
the strategy configuration module is used for configuring a log word frequency analysis strategy according to the application service log information of the acquired application log file;
the index analysis module is used for carrying out application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes;
the model training module is used for training according to the word frequency monitoring index to obtain a word frequency detection model;
and the fault cure module is used for carrying out abnormal detection through the word frequency detection model and carrying out fault cure according to the detection result.
In addition, to achieve the above object, the present invention also proposes an intelligent fault cure device, the device comprising: a memory, a processor, and a smart fault cure program stored on the memory and executable on the processor, the smart fault cure program configured to implement the steps of the smart fault cure method as described above.
In addition, to achieve the above object, the present invention also proposes a storage medium having stored thereon an intelligent fault cure program which, when executed by a processor, implements the steps of the intelligent fault cure method as described above.
Furthermore, to achieve the above object, the present invention also provides a computer program product comprising a smart fault cure program which, when executed by a processor, implements the steps of the smart fault cure method as described above.
The invention collects the application service log information of the application log file; configuring a log word frequency analysis strategy according to the application service log information; performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes; training according to the word frequency monitoring index to obtain a word frequency detection model; and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result. In this way, centralized log management and monitoring functions are provided, log data of all application programs can be summarized and displayed to identify potential problems and anomalies, and targeted fault cure can be performed. Abnormality and fault can be found in time, and response and processing can be fast. By applying log word frequency analysis and training of an algorithm model, an automatic fault healing process is realized, the requirement of manual intervention is reduced, the operation and maintenance efficiency and accuracy are improved, and the performance and user experience of the system are improved.
Drawings
FIG. 1 is a schematic diagram of the architecture of an intelligent fault cure device of a hardware operating environment in accordance with an embodiment of the present invention;
FIG. 2 is a flow chart of a first embodiment of the intelligent fault cure method of the present invention;
FIG. 3 is a flow chart of a second embodiment of the intelligent fault cure method of the present invention;
Fig. 4 is a block diagram of a first embodiment of the intelligent fault cure device of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an intelligent fault cure device of a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the intelligent fault cure device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the structure shown in fig. 1 is not limiting of the intelligent fault cure device and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and an intelligent fault cure program may be included in the memory 1005 as one type of storage medium.
In the intelligent fault cure device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the intelligent fault cure device of the present invention may be provided in the intelligent fault cure device, and the intelligent fault cure device calls the intelligent fault cure program stored in the memory 1005 through the processor 1001 and executes the intelligent fault cure method provided by the embodiment of the present invention.
The embodiment of the invention provides an intelligent fault healing method, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the intelligent fault healing method of the invention.
In this embodiment, the intelligent fault cure method includes the following steps:
step S10: and configuring a log word frequency analysis strategy according to the application service log information of the acquired application log file.
It should be noted that, the execution body of the embodiment may be a computing service device with functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, or an electronic device, an intelligent fault cure device, or the like capable of implementing the above functions. This embodiment and the following embodiments will be described below by taking an intelligent failure recovery apparatus as an example.
It should be understood that when a system fails or is abnormal, it takes longer for an operation and maintenance person to perform troubleshooting, repair and recovery, and inform relevant personnel in time, which generally requires shutdown maintenance or affects service operation, resulting in an affected service continuity, which is time and effort consuming and is prone to operating errors or omission. This approach has certain limitations in terms of efficiency, reliability, and scalability, and cannot meet the requirements of modern enterprises for efficiency, stability, and elasticity. In the traditional operation and maintenance mode, the fault processing generally needs to rely on manual judgment and decision, and involves complex tasks such as fault sensing, flow scheduling and the like. However, the manual processing has the problem of low timeliness, which may result in a slow service recovery speed, and the human factor may cause the problem to expand. To solve this problem, fault healing is an industry leading solution, namely fault automation. By means of automatic processing, a preset recovery flow can ensure that the recovery process is more reliable. By using word frequency analysis technology and self-adaptive fault cure algorithm, the fault can be positioned and recovered more quickly, thereby improving service availability of enterprises, reducing dependence on human resources and realizing unattended state of fault cure. The solution can effectively improve operation and maintenance efficiency and service quality, and saves cost and labor investment for enterprises. By means of the scheme log collector fluentd of the embodiment, log data generated by an application program are collected, and by means of effective application log collection, reliability, stability and safety of a system can be improved. And (3) applying log word frequency analysis configuration management, supporting to specify space phrase/regular word frequency analysis, supporting to specify additional fields as statistical dimensions, and adding a configuration alias mechanism for facilitating word frequency statistical query. A distributed word frequency analysis stream processing system uses KAFKASTREAM as a data processing framework. The system analyzes log data of the target application in the latest configuration and stores statistical dimension related fields and aliases into the elastic search to delineate statistical ranges from these information. And generating word frequency monitoring index data according to the analyzed data. Based on the application log word frequency data, the applied fault point data is combined, a deep learning technology is introduced, algorithm training is carried out, and a self-coding and antagonistic network cascading model is constructed. After model training, the self-encoder and the discriminator in the model can perform anomaly judgment on the prediction sample. For normal samples, the reconstruction error calculated by the self-encoder is smaller, and if the reconstruction error is larger, the normal samples are abnormal samples, and for the abnormal samples, the system automatically executes corresponding fault healing actions.
It should be noted that, the log collector Fluentd is used for collecting, transmitting and forwarding application log files, implementing centralized management and analysis of logs, supporting various input and output plug-ins, and being capable of integrating with various data sources and targets. Fluentd also have flexible data conversion and filtering functions, which can process, filter and format log data as required.
It should be appreciated that log collector fluentd is deployed in DaemonSet fashion over each node of the Kubernetes (K8S) cluster.
In a specific implementation, when an application mirror image runs, a corresponding log directory needs to be mounted in a Volume form, and an application log is written in a host disk in a file form.
It should be noted that, the collector engine is responsible for collecting log files on the host disk and transmitting the collected log data to the Kafka cluster in the unified log cluster.
It should be appreciated that the log word frequency analysis policy is applied through page configuration and persisted for the word frequency engine to analyze the application log.
In implementations, word frequency analysis is supported by specifying space-bearing phrases or regular expressions. Specific phrases or regular expressions can be specified as needed to make word frequency statistics of keywords in the log.
It should be noted that designating additional fields as statistical dimensions is supported. In addition to default word frequency statistics, additional fields may be specified as statistical dimensions as needed to analyze log data in more detail.
It should be appreciated that the configuration alias mechanism is added to facilitate word frequency statistics queries. The method can add the alias to the configuration item, is convenient for a user to use more intuitive and understandable names during inquiry, and improves the readability and maintainability of configuration.
Step S20: and carrying out application log word frequency analysis according to the log word frequency analysis strategy to obtain a word frequency monitoring index.
It should be noted that, the distributed word frequency analysis stream processing system uses KAFKASTREAM as a data processing framework. The system analyzes log data of the target application in the latest configuration and stores statistical dimension related fields and aliases into the elastic search to delineate statistical ranges from these information. And generating word frequency monitoring index data according to the analyzed data.
Further, in order to accurately obtain the word frequency monitoring index, step S20 includes: inputting the application service log information into a Kafka cluster, and consuming the cluster application log of the Kafka cluster according to a stream processing application program library to obtain consumption data; determining a log word rule according to the log word frequency analysis strategy; and generating a log word frequency monitoring index according to the consumption data and the log word rule.
It should be appreciated that log word frequency analysis is processed using KAFKA STREAM. KAFKA STREAM is a library for building real-time streaming applications that can transform and aggregate streaming data. In the log word frequency analysis, an application log is taken as an input stream, and by combining configuration of an application log word frequency analysis strategy, the processing and analysis are carried out through KAFKA STREAM, so that the word frequency statistics function is realized, large-scale log data can be processed in real time, keywords are extracted, and word frequencies are calculated.
In an implementation, the journal word frequency analysis engine Kafka-Stream is deployed on the Kubernetes cluster in Deployment manner, and specifies volumeMounts mount information.
It should be noted that, the configuration file is installed in ConfigMap, and includes db.
It should be understood that the application log is output to the Kafka cluster after being collected by the collection application service.
It should be noted that, the log word frequency analysis engine reads the log word frequency policy configuration through the timing task and stores the configuration in the memory.
In the specific implementation, the log word frequency analysis engine uses KAFKA STREAM technology to consume the message of the Kafka cluster in real time, compares the message with the log word frequency strategy configuration, extracts the log data meeting the requirements, and stores the log data into the elastic search cluster.
It should be understood that a log word frequency statistics service is applied, the word frequency configuration of the application is obtained, and a monitoring index is generated. The generated monitoring index is subsequently used for algorithm training and anomaly detection, and finally the effect of application fault cure is achieved.
The monitoring index is outputted through the following procedures: 1) The word frequency statistics service starts a cooperative procedure, and periodically inquires the word frequency configuration of the component to acquire application information. 2) The word frequency statistics service call interface queries the current word frequency total amount. 3) The word frequency statistics service call interface inquires the word frequency increment in the last 5 minutes from the elastic search, calculates the total word frequency, exposes the metrics service after calculating the total word frequency, and outputs the word frequency increment in a mode of monitoring indexes. The formula: total word frequency = current word frequency + word frequency increment.
Further, in order to accurately generate the log word frequency monitoring index, the step of generating the log word frequency monitoring index according to the consumption data and the log word rule includes: comparing the consumption data with the log word rule; determining compliance log data according to the comparison result; and storing the compliance log data into a target cluster, and generating a log word frequency monitoring index through a log word frequency statistics service and cluster data of the target cluster.
It should be understood that the application log is first input to the Kafka cluster in real time, and then log word frequency configuration data is periodically read by the log analysis engine and stored in the memory. The log analysis engine consumes application logs of the Kafka cluster in real time based on Kafka-Stream, compares consumed data with log word rules, extracts log data conforming to the rules, stores the log data into the ES cluster, and finally generates log word frequency monitoring indexes based on the ES data.
Step S30: and training according to the word frequency monitoring index to obtain a word frequency detection model.
In the specific implementation, in order to obtain the word frequency detection model, a historical fault record is also introduced to construct a training sample, and then training is carried out to obtain the word frequency detection model.
Step S40: and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result.
After training the word frequency detection model, the prediction sample is imported, an abnormal sample is found, and then the corresponding fault cure action is automatically invoked to realize automatic fault cure.
Further, in order to achieve automatic detection and fault cure, step S40 includes: inputting the prediction sample into the word frequency detection model to obtain an output abnormal sample; and determining a target fault healing action according to the abnormal sample, and performing fault healing by triggering the target fault healing action.
It should be understood that after model training is completed, the self-encoder and the arbiter in the model can perform anomaly judgment on the prediction samples, and based on the result of anomaly detection, intelligent fault cure of application is performed.
In a specific implementation, after model training is completed, the self-encoder and the discriminator in the model can perform anomaly judgment on the prediction sample. For normal samples, the reconstruction error calculated by the self-encoder is smaller, and if the reconstruction error is larger, the reconstruction error is an abnormal sample. Similarly, normal samples may be encoded by the encoder into feature vectors that can confuse the arbiter, and if they are judged to be true by the arbiter, abnormal samples may be judged to be false. Thus, in the subsequent prediction process, the prediction sample q is input into the model, based on the output D (E (p)) from the encoder network, and an anomaly score s1 is calculated; further, based on the countermeasure network output G (E (p)), the abnormality score s2 is calculated at the same time. Finally, summarizing the two parts of anomaly scores in a weighted average mode to obtain a final anomaly score s, judging that the predicted sample is abnormal when the anomaly score s is larger than a given threshold value, triggering fault cure, and otherwise, not triggering.
s1=mse(p,D(E(P)))
s2=log(1-G(E(p)))
s=α*s1+(1-α)*s2
The system presets various fault healing actions, associates the actions with corresponding algorithm models, and triggers the fault healing actions according to the abnormal detection results. For example, based on the related keywords of the access timeout in the log word frequency, a corresponding algorithm model is constructed. And detecting abnormality of the sample data, predicting that 'access timeout' occurs at a future occurrence time point of the application, and triggering the action of fault cure of the application in advance before the fault occurs.
The embodiment collects application service log information of an application log file; configuring a log word frequency analysis strategy according to the application service log information; performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes; training according to the word frequency monitoring index to obtain a word frequency detection model; and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result. In this way, centralized log management and monitoring functions are provided, log data of all application programs can be summarized and displayed to identify potential problems and anomalies, and targeted fault cure can be performed. Abnormality and fault can be found in time, and response and processing can be fast. By applying log word frequency analysis and training of an algorithm model, an automatic fault healing process is realized, the requirement of manual intervention is reduced, the operation and maintenance efficiency and accuracy are improved, and the performance and user experience of the system are improved.
Referring to fig. 3, fig. 3 is a schematic flow chart of a second embodiment of the intelligent fault cure method of the present invention.
Based on the first embodiment, in this embodiment, the step S30 includes:
Step S301: and acquiring a historical fault record, and generating a training sample according to the historical fault record and the word frequency monitoring index.
Based on the application log word frequency monitoring index, the deep learning technology is introduced in combination with the applied fault data points to train the algorithm, construct self-coding and generate an countermeasure network cascade model
Further, in order to accurately obtain the training samples, step S301 includes: determining an original index data time sequence according to the word frequency monitoring index; dividing the original index data time sequence into a plurality of sample windows in a sliding window mode; determining an abnormal window according to the historical fault record; determining a normal window according to the sample window and the abnormal window; and determining training samples according to the normal window.
It should be understood that, based on the historical fault record, the historical real fault point is marked, and in general, the anomaly detection allows the time delay in the controllable range, so we divide the original index data time sequence into a plurality of small window samples by adopting a sliding window mode, the size of the sliding window can determine a controllable range according to the actual situation, then the window with the historical real fault point is marked as an anomaly window, and all the normal windows are combined into a training sample x= { X 1,X2,...,Xn }.
Step S302: and training according to the training sample to obtain a word frequency detection model.
In a first step, training samples X are input into a model, and feature vector outputs E (X) are extracted for the training samples based on the encoder, as inputs to a decoder and a arbiter. Secondly, the decoder restores the feature vector to output D (E (x)), compares the feature vector with the original input to calculate a reconstruction LOSS LOSS1, and updates parameters of the encoder and the decoder; the feature vector in the first step is input to the discriminator to generate an output G (E (x)), the vector z is sampled from the gaussian mixture distribution to the discriminator to generate an output G (z), both of which calculate the LOSS2 for updating the discriminator parameters, and the LOSS3 for updating the encoder parameters.
Wherein,
LOSS1=mse(x,D(E(x))
LOSS2=-log G(z)-log(1-G(E(x)))
LQSS3=log(1-G(E(x)))
The embodiment obtains a history fault record and generates a training sample according to the history fault record and the word frequency monitoring index; and training according to the training sample to obtain a word frequency detection model. In this way, the automated process of fault cure is achieved by applying log word frequency analysis and training of an algorithm model.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with an intelligent fault cure program, and the intelligent fault cure program realizes the steps of the intelligent fault cure method when being executed by a processor.
In addition, the embodiment of the invention also provides a computer program product, which comprises a smart fault healing program, wherein the smart fault healing program realizes the steps of the smart fault healing method when being executed by a processor.
The specific implementation manner of the computer program product of the present invention is basically the same as that of the above embodiments of the intelligent fault cure method, and will not be repeated here.
Referring to fig. 4, fig. 4 is a block diagram showing the construction of a first embodiment of the intelligent fault cure device of the present invention.
As shown in fig. 4, the intelligent fault cure device according to the embodiment of the present invention includes:
The policy configuration module 10 is configured to configure a log word frequency analysis policy according to application service log information of the collected application log file.
And the index analysis module 20 is used for carrying out application log word frequency analysis according to the log word frequency analysis strategy to obtain a word frequency monitoring index.
The model training module 30 is configured to train to obtain a word frequency detection model according to the word frequency monitoring index.
And the fault cure module 40 is used for carrying out abnormality detection through the word frequency detection model and carrying out fault cure according to the detection result.
In this embodiment, application service log information of an application log file is collected; configuring a log word frequency analysis strategy according to the application service log information; performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes; training according to the word frequency monitoring index to obtain a word frequency detection model; and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result. In this way, centralized log management and monitoring functions are provided, log data of all application programs can be summarized and displayed to identify potential problems and anomalies, and targeted fault cure can be performed. Abnormality and fault can be found in time, and response and processing can be fast. By applying log word frequency analysis and training of an algorithm model, an automatic fault healing process is realized, the requirement of manual intervention is reduced, the operation and maintenance efficiency and accuracy are improved, and the performance and user experience of the system are improved.
In an embodiment, the index analysis module 20 is further configured to input the application service log information into a Kafka cluster, and consume a cluster application log of the Kafka cluster according to a stream processing application program library to obtain consumption data; determining a log word rule according to the log word frequency analysis strategy; and generating a log word frequency monitoring index according to the consumption data and the log word rule.
In an embodiment, the index analysis module 20 is further configured to compare the consumption data with the log word rule; determining compliance log data according to the comparison result; and storing the compliance log data into a target cluster, and generating a log word frequency monitoring index through a log word frequency statistics service and cluster data of the target cluster.
In an embodiment, the model training module 30 is further configured to obtain a historical fault record, and generate a training sample according to the historical fault record and the word frequency monitoring index; and training according to the training sample to obtain a word frequency detection model.
In one embodiment, the model training module 30 is further configured to determine a time sequence of original indicator data according to the word frequency monitoring indicator; dividing the original index data time sequence into a plurality of sample windows in a sliding window mode; determining an abnormal window according to the historical fault record; determining a normal window according to the sample window and the abnormal window; and determining training samples according to the normal window.
In an embodiment, the fault cure module 40 is further configured to input a prediction sample into the word frequency detection model to obtain an output abnormal sample; and determining a target fault healing action according to the abnormal sample, and performing fault healing by triggering the target fault healing action.
Other embodiments or specific implementation manners of the intelligent fault cure device of the present invention may refer to the above method embodiments, and will not be described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. An intelligent fault cure method, characterized in that the intelligent fault cure method comprises:
configuring a log word frequency analysis strategy according to application service log information of an acquired application log file;
performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes;
Training according to the word frequency monitoring index to obtain a word frequency detection model;
and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result.
2. The intelligent fault healing method according to claim 1, wherein the applying log word frequency analysis according to the log word frequency analysis policy to obtain the word frequency monitoring index includes:
inputting the application service log information into a Kafka cluster, and consuming the cluster application log of the Kafka cluster according to a stream processing application program library to obtain consumption data;
Determining a log word rule according to the log word frequency analysis strategy;
And generating a log word frequency monitoring index according to the consumption data and the log word rule.
3. The intelligent fault-healing method of claim 2, wherein the generating a log word frequency monitor indicator from the consumption data and the log word rule comprises:
comparing the consumption data with the log word rule;
Determining compliance log data according to the comparison result;
And storing the compliance log data into a target cluster, and generating a log word frequency monitoring index through a log word frequency statistics service and cluster data of the target cluster.
4. The intelligent fault cure method of claim 1, wherein training according to the word frequency monitoring indicator to obtain a word frequency detection model comprises:
Acquiring a historical fault record, and generating a training sample according to the historical fault record and the word frequency monitoring index;
and training according to the training sample to obtain a word frequency detection model.
5. The intelligent fault-healing method of claim 4, wherein the generating training samples from the historical fault record and the word frequency monitoring indicator comprises:
Determining an original index data time sequence according to the word frequency monitoring index;
dividing the original index data time sequence into a plurality of sample windows in a sliding window mode;
Determining an abnormal window according to the historical fault record;
determining a normal window according to the sample window and the abnormal window;
And determining training samples according to the normal window.
6. The intelligent fault-healing method according to claim 1, wherein the abnormality detection by the word frequency detection model and the fault-healing according to the detection result comprise:
inputting the prediction sample into the word frequency detection model to obtain an output abnormal sample;
And determining a target fault healing action according to the abnormal sample, and performing fault healing by triggering the target fault healing action.
7. An intelligent fault cure device, characterized in that the intelligent fault cure device comprises:
the strategy configuration module is used for configuring a log word frequency analysis strategy according to the application service log information of the acquired application log file;
the index analysis module is used for carrying out application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes;
the model training module is used for training according to the word frequency monitoring index to obtain a word frequency detection model;
and the fault cure module is used for carrying out abnormal detection through the word frequency detection model and carrying out fault cure according to the detection result.
8. An intelligent fault cure device, the device comprising: a memory, a processor, and a smart fault-healing program stored on the memory and executable on the processor, the smart fault-healing program configured to implement the steps of the smart fault-healing method of any one of claims 1 to 6.
9. A storage medium having stored thereon a smart fault cure program which when executed by a processor implements the steps of the smart fault cure method of any one of claims 1 to 6.
10. A computer program product comprising a smart fault cure program which when executed by a processor implements the steps of the smart fault cure method according to any one of claims 1 to 6.
CN202410187094.1A 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product Pending CN118093311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410187094.1A CN118093311A (en) 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410187094.1A CN118093311A (en) 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN118093311A true CN118093311A (en) 2024-05-28

Family

ID=91162565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410187094.1A Pending CN118093311A (en) 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN118093311A (en)

Similar Documents

Publication Publication Date Title
CN110474795B (en) Server capacity processing method and device, storage medium and electronic equipment
CN112115031A (en) Cluster state monitoring method and device
CN111881023B (en) Software aging prediction method and device based on multi-model comparison
CN112783682B (en) Abnormal automatic repairing method based on cloud mobile phone service
CN111626498B (en) Equipment running state prediction method, device, equipment and storage medium
CN117931583B (en) Equipment cluster running state prediction method, electronic equipment and storage medium
CN116701031A (en) Root cause model training method, analysis method and device in micro-service system
CN115080397A (en) System reliability testing method, device, equipment and storage medium
CN118484356A (en) Method and system for monitoring server state based on RPA
CN113282920B (en) Log abnormality detection method, device, computer equipment and storage medium
CN113094243B (en) Node performance detection method and device
CN113656391A (en) Data detection method and device, storage medium and electronic equipment
CN110704614B (en) Information processing method and device for predicting user group type in application
CN116755974A (en) Cloud computing platform operation and maintenance method and device, electronic equipment and storage medium
CN118093311A (en) Intelligent fault healing method, device, equipment, storage medium and product
CN111352820A (en) Method, equipment and device for predicting and monitoring running state of high-performance application
CN113570070B (en) Streaming data sampling and model updating method, device, system and storage medium
CN116166427A (en) Automatic capacity expansion and contraction method, device, equipment and storage medium
CN112801156B (en) Business big data acquisition method and server for artificial intelligence machine learning
CN114358581A (en) Method and device for determining abnormal threshold of performance index, equipment and storage medium
CN115098326A (en) System anomaly detection method and device, storage medium and electronic equipment
CN113723800A (en) Risk identification model training method and device and risk identification method and device
CN112732519A (en) Event monitoring method and device
CN113138903B (en) Method and apparatus for tracking performance of a storage system
CN115202829B (en) Power consumption prediction model training method and device of virtual machine and power consumption prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination