CN118093311A - Intelligent fault healing method, device, equipment, storage medium and product - Google Patents

Intelligent fault healing method, device, equipment, storage medium and product Download PDF

Info

Publication number
CN118093311A
CN118093311A CN202410187094.1A CN202410187094A CN118093311A CN 118093311 A CN118093311 A CN 118093311A CN 202410187094 A CN202410187094 A CN 202410187094A CN 118093311 A CN118093311 A CN 118093311A
Authority
CN
China
Prior art keywords
word frequency
log
fault
healing
cure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410187094.1A
Other languages
Chinese (zh)
Inventor
苏龙华
戴建东
杭跃斌
孙彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Jiangsu Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Jiangsu Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Jiangsu Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202410187094.1A priority Critical patent/CN118093311A/en
Publication of CN118093311A publication Critical patent/CN118093311A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses an intelligent fault cure method, a device, equipment, a storage medium and a computer program product, wherein the method is implemented by collecting application service log information of an application log file; configuring a log word frequency analysis strategy according to the application service log information; performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes; training according to the word frequency monitoring index to obtain a word frequency detection model; and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result. In this way, the log data summarizing and exposing all applications identifies potential problems and anomalies and enables targeted fault healing. Abnormality and fault can be found in time, and response and processing can be fast. The requirement of manual intervention is reduced, the operation and maintenance efficiency and accuracy are improved, and the performance and user experience of the system are improved.

Description

智能故障治愈方法、装置、设备、存储介质及产品Intelligent fault healing method, device, equipment, storage medium and product

技术领域Technical Field

本发明涉及云平台技术领域,尤其涉及一种智能故障治愈方法、装置、设备、存储介质及计算机程序产品。The present invention relates to the field of cloud platform technology, and in particular to an intelligent fault recovery method, device, equipment, storage medium and computer program product.

背景技术Background technique

当云平台系统出现故障或异常时,运维人员需要花费较长时间进行故障排查、修复和恢复,并及时通知相关人员,这通常需要停机维护或影响业务运行,导致业务连续性受到影响,这不仅耗费时间和精力,还容易出现操作错误或遗漏的情况。这种方式在效率、可靠性和可扩展性方面存在一定的局限性,无法满足现代企业对高效、稳定和弹性的需求。When a cloud platform system fails or is abnormal, the operation and maintenance personnel need to spend a long time troubleshooting, repairing and restoring the system, and promptly notify relevant personnel, which usually requires downtime for maintenance or affects business operations, resulting in business continuity being affected. This not only consumes time and energy, but is also prone to operational errors or omissions. This approach has certain limitations in terms of efficiency, reliability and scalability, and cannot meet the needs of modern enterprises for efficiency, stability and flexibility.

发明内容Summary of the invention

本发明的主要目的在于提供了一种智能故障治愈方法、装置、设备、存储介质及计算机程序产品,旨在解决现有技术故障处理时效性低的技术问题。The main purpose of the present invention is to provide an intelligent fault recovery method, device, equipment, storage medium and computer program product, aiming to solve the technical problem of low timeliness of fault processing in the prior art.

为实现上述目的,本发明提供了一种智能故障治愈方法,所述方法包括以下步骤:To achieve the above object, the present invention provides an intelligent fault healing method, which comprises the following steps:

根据采集应用日志文件的应用服务日志信息配置日志词频分析策略;Configure the log word frequency analysis strategy based on the application service log information collected from the application log file;

根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标;Perform application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indicators;

根据所述词频监控指标训练得到词频检测模型;A word frequency detection model is obtained by training according to the word frequency monitoring indicator;

通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈。Anomaly detection is performed using the word frequency detection model, and fault recovery is performed based on the detection result.

可选地,所述根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标,包括:Optionally, performing application log word frequency analysis according to the log word frequency analysis strategy to obtain a word frequency monitoring indicator includes:

将所述应用服务日志信息输入到Kafka集群,并根据流处理应用程序库消费所述Kafka集群的集群应用日志,得到消费数据;Input the application service log information into the Kafka cluster, and consume the cluster application log of the Kafka cluster according to the stream processing application library to obtain consumption data;

根据所述日志词频分析策略确定日志词规则;Determining log word rules according to the log word frequency analysis strategy;

根据所述消费数据和所述日志词规则生成日志词频监控指标。A log word frequency monitoring indicator is generated according to the consumption data and the log word rule.

可选地,所述根据所述消费数据和所述日志词规则生成日志词频监控指标,包括:Optionally, generating a log word frequency monitoring indicator according to the consumption data and the log word rule includes:

将所述消费数据和所述日志词规则进行比较;comparing the consumption data with the log word rule;

根据比较结果确定合规日志数据;determining compliance log data based on the comparison results;

将所述合规日志数据存入目标集群,并通过日志词频统计服务和所述目标集群的集群数据生成日志词频监控指标。The compliant log data is stored in a target cluster, and a log word frequency monitoring index is generated through a log word frequency statistics service and cluster data of the target cluster.

可选地,所述根据所述词频监控指标训练得到词频检测模型,包括:Optionally, the training of a word frequency detection model according to the word frequency monitoring indicator includes:

获取历史故障记录,并根据所述历史故障记录和所述词频监控指标生成训练样本;Obtaining historical fault records, and generating training samples according to the historical fault records and the word frequency monitoring index;

根据所述训练样本训练得到词频检测模型。A word frequency detection model is obtained by training according to the training samples.

可选地,所述根据所述历史故障记录和所述词频监控指标生成训练样本,包括:Optionally, generating a training sample according to the historical fault record and the word frequency monitoring indicator includes:

根据所述词频监控指标确定原始指标数据时间序列;Determine the original indicator data time series according to the word frequency monitoring indicator;

将所述原始指标数据时间序列采用滑动窗口的方式切分成多个样本窗口;Divide the original indicator data time series into multiple sample windows by using a sliding window method;

根据所述历史故障记录确定异常窗口;Determine an abnormal window according to the historical fault records;

根据所述样本窗口和所述异常窗口确定正常窗口;Determine a normal window according to the sample window and the abnormal window;

根据所述正常窗口确定训练样本。A training sample is determined according to the normal window.

可选地,所述通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈,包括:Optionally, performing anomaly detection by using the word frequency detection model and performing fault recovery according to the detection result includes:

将预测样本输入到所述词频检测模型,得到输出的异常样本;Inputting the predicted sample into the word frequency detection model to obtain an output abnormal sample;

根据所述异常样本确定目标故障治愈动作,并通过触发所述目标故障治愈动作进行故障治愈此外,为实现上述目的,本发明还提出一种智能故障治愈装置所述智能故障治愈装置包括:Determine a target fault healing action according to the abnormal sample, and perform fault healing by triggering the target fault healing action. In addition, to achieve the above purpose, the present invention also proposes an intelligent fault healing device, the intelligent fault healing device comprising:

策略配置模块,用于根据采集应用日志文件的应用服务日志信息配置日志词频分析策略;A policy configuration module, used to configure a log word frequency analysis policy based on application service log information collected from application log files;

指标分析模块,用于根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标;An indicator analysis module, used to perform application log word frequency analysis according to the log word frequency analysis strategy to obtain a word frequency monitoring indicator;

模型训练模块,用于根据所述词频监控指标训练得到词频检测模型;A model training module, used to train a word frequency detection model according to the word frequency monitoring indicator;

故障治愈模块,用于通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈。The fault recovery module is used to perform anomaly detection through the word frequency detection model and to recover the fault according to the detection result.

此外,为实现上述目的,本发明还提出一种智能故障治愈设备,所述设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的智能故障治愈程序,所述智能故障治愈程序配置为实现如上文所述的智能故障治愈方法的步骤。In addition, to achieve the above-mentioned objectives, the present invention also proposes an intelligent fault healing device, which includes: a memory, a processor, and an intelligent fault healing program stored in the memory and executable on the processor, wherein the intelligent fault healing program is configured to implement the steps of the intelligent fault healing method described above.

此外,为实现上述目的,本发明还提出一种存储介质,所述存储介质上存储有智能故障治愈程序,所述智能故障治愈程序被处理器执行时实现如上文所述的智能故障治愈方法的步骤。In addition, to achieve the above-mentioned purpose, the present invention also proposes a storage medium, on which an intelligent fault healing program is stored, and when the intelligent fault healing program is executed by a processor, the steps of the intelligent fault healing method described above are implemented.

此外,为实现上述目的,本发明还提供一种计算机程序产品,所述计算机程序产品包括智能故障治愈程序,所述智能故障治愈程序被处理器执行时实现如上文所述的智能故障治愈方法的步骤。In addition, to achieve the above objectives, the present invention also provides a computer program product, which includes an intelligent fault healing program, and when the intelligent fault healing program is executed by a processor, the steps of the intelligent fault healing method described above are implemented.

本发明采集应用日志文件的应用服务日志信息;根据所述应用服务日志信息配置日志词频分析策略;根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标;根据所述词频监控指标训练得到词频检测模型;通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈。通过这种方式,提供集中化的日志管理和监控功能,可以汇总和展示所有应用程序的日志数据识别潜在的问题和异常,并进行针对性的故障治愈。及时发现异常和故障,可以快速响应和处理。通过应用日志词频分析和算法模型的训练,实现故障治愈的自动化过程,减少人工干预的需求,提高运维效率和准确性,提升系统的性能和用户体验。The present invention collects application service log information of application log files; configures a log word frequency analysis strategy according to the application service log information; performs application log word frequency analysis according to the log word frequency analysis strategy to obtain a word frequency monitoring index; trains a word frequency detection model according to the word frequency monitoring index; performs anomaly detection through the word frequency detection model, and performs fault healing according to the detection result. In this way, a centralized log management and monitoring function is provided, which can summarize and display the log data of all applications to identify potential problems and anomalies, and perform targeted fault healing. Anomalies and faults can be discovered in time and can be responded to and handled quickly. By applying log word frequency analysis and training of algorithm models, the automated process of fault healing is realized, reducing the need for manual intervention, improving operation and maintenance efficiency and accuracy, and enhancing system performance and user experience.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例方案涉及的硬件运行环境的智能故障治愈设备的结构示意图;FIG1 is a schematic diagram of the structure of an intelligent fault recovery device in a hardware operating environment according to an embodiment of the present invention;

图2为本发明智能故障治愈方法第一实施例的流程示意图;FIG2 is a schematic diagram of a flow chart of a first embodiment of an intelligent fault recovery method according to the present invention;

图3为本发明智能故障治愈方法第二实施例的流程示意图;FIG3 is a schematic diagram of a flow chart of a second embodiment of an intelligent fault recovery method according to the present invention;

图4为本发明智能故障治愈装置第一实施例的结构框图。FIG4 is a structural block diagram of the first embodiment of the intelligent fault recovery device of the present invention.

本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further explained in conjunction with embodiments and with reference to the accompanying drawings.

具体实施方式Detailed ways

应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, and are not used to limit the present invention.

参照图1,图1为本发明实施例方案涉及的硬件运行环境的智能故障治愈设备结构示意图。Refer to FIG. 1 , which is a schematic diagram of the structure of an intelligent fault recovery device in a hardware operating environment according to an embodiment of the present invention.

如图1所示,该智能故障治愈设备可以包括:处理器1001,例如中央处理器(Central Processing Unit,CPU),通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity,WI-FI)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory,RAM),也可以是稳定的非易失性存储器(Non-Volatile Memory,NVM),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG1 , the intelligent fault healing device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (Wireless-Fidelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM), or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk storage. The memory 1005 may also be a storage device independent of the aforementioned processor 1001.

本领域技术人员可以理解,图1中示出的结构并不构成对智能故障治愈设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art will appreciate that the structure shown in FIG. 1 does not constitute a limitation on the intelligent fault healing device, and may include more or fewer components than shown in the figure, or a combination of certain components, or a different arrangement of components.

如图1所示,作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及智能故障治愈程序。As shown in FIG. 1 , the memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and an intelligent fault recovery program.

在图1所示的智能故障治愈设备中,网络接口1004主要用于与网络服务器进行数据通信;用户接口1003主要用于与用户进行数据交互;本发明智能故障治愈设备中的处理器1001、存储器1005可以设置在智能故障治愈设备中,所述智能故障治愈设备通过处理器1001调用存储器1005中存储的智能故障治愈程序,并执行本发明实施例提供的智能故障治愈方法。In the intelligent fault healing device shown in Figure 1, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the intelligent fault healing device of the present invention can be set in the intelligent fault healing device, and the intelligent fault healing device calls the intelligent fault healing program stored in the memory 1005 through the processor 1001, and executes the intelligent fault healing method provided by the embodiment of the present invention.

本发明实施例提供了一种智能故障治愈方法,参照图2,图2为本发明智能故障治愈方法第一实施例的流程示意图。An embodiment of the present invention provides an intelligent fault healing method. Referring to FIG. 2 , FIG. 2 is a schematic flow chart of a first embodiment of the intelligent fault healing method of the present invention.

本实施例中,所述智能故障治愈方法包括以下步骤:In this embodiment, the intelligent fault recovery method includes the following steps:

步骤S10:根据采集应用日志文件的应用服务日志信息配置日志词频分析策略。Step S10: configuring a log word frequency analysis strategy according to the application service log information of the collected application log file.

需要说明的是,本实施例的执行主体可以是一种具有数据处理、网络通信以及程序运行功能的计算服务设备,例如平板电脑、个人电脑、手机等,或者是一种能够实现上述功能的电子设备、智能故障治愈设备等。以下以智能故障治愈设备为例,对本实施例及下述各实施例进行说明。It should be noted that the execution subject of this embodiment can be a computing service device with data processing, network communication and program running functions, such as a tablet computer, a personal computer, a mobile phone, etc., or an electronic device capable of realizing the above functions, an intelligent fault healing device, etc. The following takes the intelligent fault healing device as an example to illustrate this embodiment and the following embodiments.

应理解的是,当系统出现故障或异常时,运维人员需要花费较长时间进行故障排查、修复和恢复,并及时通知相关人员,这通常需要停机维护或影响业务运行,导致业务连续性受到影响,这不仅耗费时间和精力,还容易出现操作错误或遗漏的情况。这种方式在效率、可靠性和可扩展性方面存在一定的局限性,无法满足现代企业对高效、稳定和弹性的需求。传统的运维方式中,故障处理通常需要依赖人工判断和决策,涉及到故障感知、流量调度等复杂的任务。然而,人工处理存在时效性不高的问题,可能导致服务恢复速度较慢,同时人为因素也可能引发问题的扩大。为了解决这一问题,故障治愈成为行业领先的解决方案,即故障自动化处理。通过自动化处理,预先设定的恢复流程可以确保恢复过程更加可靠。利用词频分析技术和自适应故障治愈算法,可以更快地定位和恢复故障,从而提高企业的服务可用性并降低对人力资源的依赖,实现故障治愈的无人值守状态。这样的解决方案可以有效地提升运维效率和服务质量,为企业节省成本和人力投入。而通过本实施例的方案日志采集器fluentd收集应用程序生成的日志数据,通过有效的应用日志采集,可以提高系统的可靠性、稳定性和安全性。应用日志词频分析配置管理,支持指定带空格短语/正则进行词频分析,支持指定额外字段作为统计维度,为便于词频统计查询,增加配置别名机制。分布式词频分析流处理系统,使用KafkaStream作为数据处理框架。该系统以最新配置分析目标应用的日志数据,并将统计维度相关字段和别名存储到Elasticsearch中,以便根据这些信息来圈定统计范围。根据分析的数据,生成词频监控指标数据。基于应用日志词频数据,结合应用的故障点数据,引入深度学习技术,进行算法的训练,构建自编码和生成对抗网络级联模型。模型训练后,模型中的自编码器和判别器均能对预测样本进行异常判断。对于正常样本来说,经过自编码器计算出的重构误差较小,较大则为异常样本,对于异常的样本,系统自动执行对应的故障治愈的动作。It should be understood that when a system fails or is abnormal, the operation and maintenance personnel need to spend a long time to troubleshoot, repair and restore the system, and notify relevant personnel in a timely manner. This usually requires downtime maintenance or affects business operations, resulting in business continuity being affected. This not only consumes time and energy, but is also prone to operational errors or omissions. This approach has certain limitations in terms of efficiency, reliability and scalability, and cannot meet the needs of modern enterprises for efficiency, stability and flexibility. In traditional operation and maintenance methods, fault handling usually relies on manual judgment and decision-making, involving complex tasks such as fault perception and traffic scheduling. However, manual processing has the problem of low timeliness, which may lead to slow service recovery, and human factors may also cause the expansion of problems. In order to solve this problem, fault healing has become an industry-leading solution, namely, automated fault handling. Through automated processing, the pre-set recovery process can ensure that the recovery process is more reliable. Using word frequency analysis technology and adaptive fault healing algorithms, faults can be located and restored more quickly, thereby improving the service availability of the enterprise and reducing dependence on human resources, and achieving unattended fault healing. Such a solution can effectively improve operation and maintenance efficiency and service quality, saving costs and manpower investment for enterprises. The log collector fluentd of the scheme of this embodiment collects the log data generated by the application. Through effective application log collection, the reliability, stability and security of the system can be improved. Application log word frequency analysis configuration management supports specifying phrases/regulars with spaces for word frequency analysis, supports specifying additional fields as statistical dimensions, and adds a configuration alias mechanism to facilitate word frequency statistical queries. The distributed word frequency analysis stream processing system uses KafkaStream as the data processing framework. The system analyzes the log data of the target application with the latest configuration, and stores the statistical dimension-related fields and aliases in Elasticsearch so that the statistical range can be delineated based on this information. Based on the analyzed data, word frequency monitoring indicator data is generated. Based on the application log word frequency data, combined with the application's fault point data, deep learning technology is introduced to train the algorithm, and an autoencoder and generative adversarial network cascade model is constructed. After the model is trained, both the autoencoder and the discriminator in the model can make abnormal judgments on the predicted samples. For normal samples, the reconstruction error calculated by the autoencoder is small, and a large one is an abnormal sample. For abnormal samples, the system automatically performs the corresponding fault healing action.

需要说明的是,日志采集器Fluentd用于应用日志文件的收集、传输和转发,实现日志的集中管理和分析,支持多种输入和输出插件,可以与各种数据源和目标进行集成。Fluentd还具有灵活的数据转换和过滤功能,可以根据需求对日志数据进行处理、过滤和格式化。It should be noted that the log collector Fluentd is used to collect, transmit and forward application log files, realize centralized management and analysis of logs, support multiple input and output plug-ins, and can be integrated with various data sources and targets. Fluentd also has flexible data conversion and filtering functions, which can process, filter and format log data according to needs.

应理解的是,日志采集器fluentd以DaemonSet方式部署于Kubernetes(K8S)集群的每个节点之上。It should be understood that the log collector fluentd is deployed on each node of the Kubernetes (K8S) cluster in DaemonSet mode.

在具体实施中,应用镜像在运行时,需要以Volume的形式挂载对应的日志目录,并将应用日志以文件形式写入主机磁盘上。In a specific implementation, when the application image is running, the corresponding log directory needs to be mounted in the form of a volume, and the application log is written to the host disk in the form of a file.

需要说明的是,采集器引擎负责采集主机磁盘上的日志文件,并将采集到的日志数据传输到统一日志集群中的Kafka集群。It should be noted that the collector engine is responsible for collecting log files on the host disk and transmitting the collected log data to the Kafka cluster in the unified log cluster.

应理解的是,通过页面配置应用日志词频分析策略,并持久化存储词频分析策略以供词频引擎分析应用日志。It should be understood that the word frequency analysis strategy for the application log is configured through the page, and the word frequency analysis strategy is persistently stored for the word frequency engine to analyze the application log.

在具体实施中,支持指定带空格短语或正则表达式进行词频分析。可以根据需要指定特定的短语或正则表达式来进行日志中关键词的词频统计。In the specific implementation, it supports specifying phrases with spaces or regular expressions for word frequency analysis. You can specify specific phrases or regular expressions as needed to perform word frequency statistics of keywords in the log.

需要说明的是,支持指定额外字段作为统计维度。除了默认的词频统计外,可以根据需求指定额外的字段作为统计维度,以更详细地分析日志数据。It should be noted that it supports specifying additional fields as statistical dimensions. In addition to the default word frequency statistics, you can specify additional fields as statistical dimensions as needed to analyze log data in more detail.

应理解的是,增加配置别名机制以便于词频统计查询。可以为配置项增加别名,方便用户在查询时使用更加直观和易懂的名称,提高配置的可读性和可维护性。It should be understood that the configuration alias mechanism is added to facilitate word frequency statistics query. Aliases can be added to configuration items to facilitate users to use more intuitive and easy-to-understand names when querying, improving the readability and maintainability of the configuration.

步骤S20:根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标。Step S20: Performing application log word frequency analysis according to the log word frequency analysis strategy to obtain a word frequency monitoring index.

需要说明的是,分布式词频分析流处理系统,使用KafkaStream作为数据处理框架。该系统以最新配置分析目标应用的日志数据,并将统计维度相关字段和别名存储到Elasticsearch中,以便根据这些信息来圈定统计范围。根据分析的数据,生成词频监控指标数据。It should be noted that the distributed word frequency analysis stream processing system uses KafkaStream as the data processing framework. The system analyzes the log data of the target application with the latest configuration, and stores the relevant fields and aliases of the statistical dimensions in Elasticsearch so that the statistical scope can be defined based on this information. Based on the analyzed data, word frequency monitoring indicator data is generated.

进一步的,为了准确的得到词频监控指标,步骤S20包括:将所述应用服务日志信息输入到Kafka集群,并根据流处理应用程序库消费所述Kafka集群的集群应用日志,得到消费数据;根据所述日志词频分析策略确定日志词规则;根据所述消费数据和所述日志词规则生成日志词频监控指标。Furthermore, in order to accurately obtain the word frequency monitoring index, step S20 includes: inputting the application service log information into the Kafka cluster, and consuming the cluster application log of the Kafka cluster according to the stream processing application library to obtain consumption data; determining the log word rule according to the log word frequency analysis strategy; generating the log word frequency monitoring index according to the consumption data and the log word rule.

应理解的是,日志词频分析使用Kafka Stream进行处理。Kafka Stream是一个用于构建实时流处理应用程序的库,可以对流数据进行转换和聚合操作。在日志词频分析中,将应用日志作为输入流,结合应用日志词频分析策略的配置,通过Kafka Stream进行处理和分析,实现词频统计的功能,可以实时处理大规模的日志数据,提取关键词并计算词频。It should be understood that log word frequency analysis uses Kafka Stream for processing. Kafka Stream is a library for building real-time stream processing applications that can transform and aggregate stream data. In log word frequency analysis, application logs are used as input streams, combined with the configuration of application log word frequency analysis strategies, and processed and analyzed through Kafka Stream to implement word frequency statistics. It can process large-scale log data in real time, extract keywords and calculate word frequencies.

在具体实施中,日志词频分析引擎Kafka-Stream以Deployment方式部署在Kubernetes集群上,并指定volumeMounts挂载信息。In the specific implementation, the log word frequency analysis engine Kafka-Stream is deployed on the Kubernetes cluster in Deployment mode, and the volumeMounts mounting information is specified.

需要说明的是,配置文件通过ConfigMap形式挂载,包括db.properties和config.properties文件,其中指定了db、Kafka和elasticsearch集群的地址信息。It should be noted that the configuration files are mounted in the form of ConfigMap, including db.properties and config.properties files, which specify the address information of the db, Kafka, and elasticsearch clusters.

应理解的是,应用日志经过采集应用服务采集后,输出到Kafka集群中。It should be understood that the application logs are collected by the collection application service and then output to the Kafka cluster.

需要说明的是,日志词频分析引擎通过定时任务读取日志词频策略配置,并存储在内存中。It should be noted that the log word frequency analysis engine reads the log word frequency strategy configuration through a scheduled task and stores it in memory.

在具体实施中,日志词频分析引擎使用Kafka Stream技术,实时消费Kafka集群的消息,并和日志词频策略配置进行比较,提取出符合要求的日志数据,并存储到Elasticsearch集群中。In the specific implementation, the log word frequency analysis engine uses Kafka Stream technology to consume messages from the Kafka cluster in real time, compare them with the log word frequency strategy configuration, extract log data that meets the requirements, and store them in the Elasticsearch cluster.

应理解的是,应用日志词频统计服务,获取应用的词频配置,并生成监控指标。生成的监控指标后续用于算法训练和异常检测,最终达到应用故障治愈的效果。It should be understood that the application log word frequency statistics service obtains the word frequency configuration of the application and generates monitoring indicators. The generated monitoring indicators are subsequently used for algorithm training and anomaly detection, ultimately achieving the effect of curing application failures.

需要说明的是,通过以下流程对监控指标进行输出:1)词频统计服务启动协程,定时查询组件的词频配置,获取应用信息。2)词频统计服务调用接口查询当前词频总量。3)词频统计服务调用接口从Elasticsearch查询最近5分钟内的词频增量,通过以下公式,对词频总量进行计算,计算出词频总量后,暴露metrics服务,以监控指标的方式进行输出。公式:词频总量=当前词频量+词频增量。It should be noted that the monitoring indicators are output through the following process: 1) The word frequency statistics service starts the coroutine, regularly queries the word frequency configuration of the component, and obtains application information. 2) The word frequency statistics service calls the interface to query the current total word frequency. 3) The word frequency statistics service calls the interface to query the word frequency increment in the last 5 minutes from Elasticsearch, and calculates the total word frequency through the following formula. After calculating the total word frequency, the metrics service is exposed and output as a monitoring indicator. Formula: Total word frequency = current word frequency + word frequency increment.

进一步的,为了准确的生成日志词频监控指标,根据所述消费数据和所述日志词规则生成日志词频监控指标的步骤包括:将所述消费数据和所述日志词规则进行比较;根据比较结果确定合规日志数据;将所述合规日志数据存入目标集群,并通过日志词频统计服务和所述目标集群的集群数据生成日志词频监控指标。Furthermore, in order to accurately generate log word frequency monitoring indicators, the steps of generating log word frequency monitoring indicators based on the consumption data and the log word rules include: comparing the consumption data with the log word rules; determining compliant log data based on the comparison results; storing the compliant log data into a target cluster, and generating log word frequency monitoring indicators through a log word frequency statistics service and the cluster data of the target cluster.

应理解的是,首先将应用日志实时输入到Kafka集群,然后通过日志分析引擎定期读取日志词频配置数据,并存入内存。日志分析引擎基于Kafka-Stream实时消费Kafka集群的应用日志,然后将消费的数据和日志词规则进行比较,提取出符合规则的日志数据,并存入ES集群,最后基于ES的数据,生成日志词频监控指标。It should be understood that the application log is first input into the Kafka cluster in real time, and then the log word frequency configuration data is periodically read by the log analysis engine and stored in memory. The log analysis engine consumes the application log of the Kafka cluster in real time based on Kafka-Stream, and then compares the consumed data with the log word rules, extracts the log data that meets the rules, and stores it in the ES cluster. Finally, based on the ES data, the log word frequency monitoring indicator is generated.

步骤S30:根据所述词频监控指标训练得到词频检测模型。Step S30: training a word frequency detection model according to the word frequency monitoring indicator.

在具体实施中,为了得到词频检测模型,还要引入历史故障记录进行训练样本的构建,然后进行训练得到词频检测模型。In the specific implementation, in order to obtain the word frequency detection model, it is also necessary to introduce historical fault records to construct training samples, and then train to obtain the word frequency detection model.

步骤S40:通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈。Step S40: performing anomaly detection using the word frequency detection model, and performing fault recovery according to the detection result.

需要说明的是,在训练完成词频检测模型之后,再将预测样本导入,发现异常样本,然后进行自动调用对应的故障治愈的动作实现自动的故障治愈。It should be noted that after the word frequency detection model is trained, the predicted samples are imported, abnormal samples are found, and then the corresponding fault healing actions are automatically called to achieve automatic fault healing.

进一步的,为了实现自动检测和故障治愈,步骤S40包括:将预测样本输入到所述词频检测模型,得到输出的异常样本;根据所述异常样本确定目标故障治愈动作,并通过触发所述目标故障治愈动作进行故障治愈。Furthermore, in order to achieve automatic detection and fault healing, step S40 includes: inputting the predicted sample into the word frequency detection model to obtain an output abnormal sample; determining a target fault healing action according to the abnormal sample, and performing fault healing by triggering the target fault healing action.

应理解的是,完成模型训练后,模型中的自编码器和判别器均能对预测样本进行异常判断,基于异常检测的结果,进行应用的智能故障治愈。It should be understood that after completing the model training, both the autoencoder and the discriminator in the model can make anomaly judgments on the predicted samples, and based on the results of anomaly detection, perform intelligent fault healing of the application.

在具体实施中,完成模型训练后,模型中的自编码器和判别器均能对预测样本进行异常判断。对于正常样本来说,经过自编码器计算出的重构误差较小,较大则为异常样本。同样的,正常样本可被编码器编码成能够混淆判别器的特征向量,则被判别器判断为真的,异常样本则会被判断为假的。因此,在后续预测过程中,将预测样本q输入到模型中,基于自编码器网络输出D(E(p)),并计算异常得分s1;另外再基于对抗网络输出G(E(p)),同时计算异常得分s2。最终,采用加权平均的方式汇总两部分异常得分,得到最终的异常得分s,当异常得分s大于给定阈值,则判断预测样本存在异常,触发故障治愈,否则不触发。In the specific implementation, after the model training is completed, the autoencoder and discriminator in the model can both make abnormal judgments on the predicted samples. For normal samples, the reconstruction error calculated by the autoencoder is small, and a larger one is an abnormal sample. Similarly, normal samples can be encoded by the encoder into feature vectors that can confuse the discriminator, then the discriminator judges it to be true, and the abnormal samples will be judged to be false. Therefore, in the subsequent prediction process, the predicted sample q is input into the model, based on the output D(E(p)) of the autoencoder network, and the abnormal score s1 is calculated; in addition, based on the output G(E(p)) of the adversarial network, the abnormal score s2 is calculated at the same time. Finally, the two parts of the abnormal score are summarized by weighted averaging to obtain the final abnormal score s. When the abnormal score s is greater than the given threshold, it is judged that the predicted sample is abnormal and the fault recovery is triggered, otherwise it is not triggered.

s1=mse(p,D(E(P)))s1=mse(p,D(E(P)))

s2=log(1-G(E(p)))s2=log(1-G(E(p)))

s=α*s1+(1-α)*s2s=α*s1+(1-α)*s2

需要说明的是,系统预置多种故障治愈动作,并和对应的算法模型进行关联,根据异常检测的结果,进行故障治愈动作的触发。比如基于日志词频中“访问超时”相关的关键字,构建了对应的算法模型。并对样本数据进行异常检测,预测应用未来出现的时间点会出现“访问超时”,在故障发生前,提前进行应用的故障治愈的动作的触发。It should be noted that the system presets a variety of fault recovery actions and associates them with corresponding algorithm models. The fault recovery actions are triggered based on the results of anomaly detection. For example, based on the keywords related to "access timeout" in the log word frequency, a corresponding algorithm model is built. The sample data is tested for anomalies to predict when "access timeout" will occur in the future of the application. The application's fault recovery actions are triggered in advance before the failure occurs.

本实施例采集应用日志文件的应用服务日志信息;根据所述应用服务日志信息配置日志词频分析策略;根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标;根据所述词频监控指标训练得到词频检测模型;通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈。通过这种方式,提供集中化的日志管理和监控功能,可以汇总和展示所有应用程序的日志数据识别潜在的问题和异常,并进行针对性的故障治愈。及时发现异常和故障,可以快速响应和处理。通过应用日志词频分析和算法模型的训练,实现故障治愈的自动化过程,减少人工干预的需求,提高运维效率和准确性,提升系统的性能和用户体验。This embodiment collects application service log information from application log files; configures a log word frequency analysis strategy based on the application service log information; performs application log word frequency analysis based on the log word frequency analysis strategy to obtain word frequency monitoring indicators; trains a word frequency detection model based on the word frequency monitoring indicators; performs anomaly detection through the word frequency detection model, and performs fault recovery based on the detection results. In this way, a centralized log management and monitoring function is provided, which can summarize and display the log data of all applications to identify potential problems and anomalies, and perform targeted fault recovery. Anomalies and faults can be discovered in a timely manner and can be responded to and handled quickly. By applying log word frequency analysis and training of algorithm models, the automated process of fault recovery is realized, reducing the need for manual intervention, improving operation and maintenance efficiency and accuracy, and enhancing system performance and user experience.

参考图3,图3为本发明智能故障治愈方法第二实施例的流程示意图。Refer to FIG3 , which is a flow chart of a second embodiment of the intelligent fault recovery method of the present invention.

基于上述第一实施例,在本实施例中,所述步骤S30包括:Based on the above first embodiment, in this embodiment, step S30 includes:

步骤S301:获取历史故障记录,并根据所述历史故障记录和所述词频监控指标生成训练样本。Step S301: Obtain historical fault records, and generate training samples according to the historical fault records and the word frequency monitoring index.

需要说明的是,基于应用日志词频监控指标,结合应用的故障数据点,引入深度学习技术,进行算法的训练,构建自编码和生成对抗网络级联模型It should be noted that based on the application log word frequency monitoring indicators, combined with the application failure data points, deep learning technology is introduced to train the algorithm and build a cascade model of autoencoder and generative adversarial network.

进一步的,为了准确的得到训练样本,步骤S301包括:根据所述词频监控指标确定原始指标数据时间序列;将所述原始指标数据时间序列采用滑动窗口的方式切分成多个样本窗口;根据所述历史故障记录确定异常窗口;根据所述样本窗口和所述异常窗口确定正常窗口;根据所述正常窗口确定训练样本。Furthermore, in order to accurately obtain training samples, step S301 includes: determining the original indicator data time series according to the word frequency monitoring indicator; dividing the original indicator data time series into multiple sample windows using a sliding window method; determining the abnormal window according to the historical fault records; determining the normal window according to the sample window and the abnormal window; and determining the training sample according to the normal window.

应理解的是,基于历史故障记录,标记出历史真实的故障点,一般来说,异常检测允许可控范围内的时延,因此,我们将原始指标数据时间序列采用滑动窗口的方式切分成多个小窗口样本,滑动窗口的大小可根据实际情况来确定一个可控范围,后将存在历史真实故障点的窗口标记为异常窗口,将所有正常窗口组合成训练样本X={X1,X2,...,Xn}。It should be understood that based on historical fault records, the real historical fault points are marked. Generally speaking, anomaly detection allows for delays within a controllable range. Therefore, we divide the original indicator data time series into multiple small window samples using a sliding window. The size of the sliding window can be determined within a controllable range based on actual conditions. Then, the window with the real historical fault point is marked as an abnormal window, and all normal windows are combined into a training sample X = { X1 , X2 , ..., Xn }.

步骤S302:根据所述训练样本训练得到词频检测模型。Step S302: obtaining a word frequency detection model by training according to the training samples.

在具体实施中,第一步,将训练样本X输入模型,基于编码器对训练样本提取特征向量输出E(x),将其作为解码器和判别器的输入。第二步,解码器将特征向量还原后输出D(E(x)),将其与原始输入对比计算出重构损失LOSS1,并更新编码器和解码器的参数;将第一步中的特征向量输入判别器产生输出G(E(x)),从高斯混合分布中采样出向量z输入判别器产生输出G(z),两者计算出损失LOSS2,用于更新判别器参数,再计算出损失LOSS3,用于更新编码器参数。In the specific implementation, in the first step, the training sample X is input into the model, and the encoder extracts the feature vector output E(x) from the training sample, which is used as the input of the decoder and discriminator. In the second step, the decoder restores the feature vector and outputs D(E(x)), compares it with the original input to calculate the reconstruction loss LOSS1, and updates the parameters of the encoder and decoder; the feature vector in the first step is input into the discriminator to generate the output G(E(x)), and the vector z is sampled from the Gaussian mixture distribution and input into the discriminator to generate the output G(z). The two calculate the loss LOSS2, which is used to update the discriminator parameters, and then calculate the loss LOSS3, which is used to update the encoder parameters.

其中,in,

LOSS1=mse(x,D(E(x))LOSS1=mse(x,D(E(x))

LOSS2=-log G(z)-log(1-G(E(x)))LOSS2=-log G(z)-log(1-G(E(x)))

LQSS3=log(1-G(E(x)))LQSS3=log(1-G(E(x)))

本实施例获取历史故障记录,并根据所述历史故障记录和所述词频监控指标生成训练样本;根据所述训练样本训练得到词频检测模型。通过这种方式,通过应用日志词频分析和算法模型的训练,实现故障治愈的自动化过程。This embodiment obtains historical fault records, generates training samples based on the historical fault records and the word frequency monitoring index, and obtains a word frequency detection model through training based on the training samples. In this way, the automatic process of fault recovery is realized by applying log word frequency analysis and algorithm model training.

此外,本发明实施例还提出一种存储介质,所述存储介质上存储有智能故障治愈程序,所述智能故障治愈程序被处理器执行时实现如上文所述的智能故障治愈方法的步骤。In addition, an embodiment of the present invention further provides a storage medium, on which an intelligent fault healing program is stored. When the intelligent fault healing program is executed by a processor, the steps of the intelligent fault healing method described above are implemented.

此外,本发明实施例还提出一种计算机程序产品,包括智能故障治愈程序,所述智能故障治愈程序被处理器执行时实现如上所述的智能故障治愈方法的步骤。In addition, an embodiment of the present invention further provides a computer program product, including an intelligent fault healing program, which implements the steps of the intelligent fault healing method described above when executed by a processor.

本发明计算机程序产品具体实施方式与上述智能故障治愈方法各实施例基本相同,在此不再赘述。The specific implementation methods of the computer program product of the present invention are basically the same as the above-mentioned embodiments of the intelligent fault healing method, and will not be repeated here.

参照图4,图4为本发明智能故障治愈装置第一实施例的结构框图。Refer to FIG. 4 , which is a structural block diagram of a first embodiment of an intelligent fault healing device according to the present invention.

如图4所示,本发明实施例提出的智能故障治愈装置包括:As shown in FIG4 , the intelligent fault recovery device proposed in the embodiment of the present invention includes:

策略配置模块10,用于根据采集应用日志文件的应用服务日志信息配置日志词频分析策略。The strategy configuration module 10 is used to configure the log word frequency analysis strategy according to the application service log information collected from the application log file.

指标分析模块20,用于根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标。The indicator analysis module 20 is used to perform word frequency analysis on the application log according to the log word frequency analysis strategy to obtain a word frequency monitoring indicator.

模型训练模块30,用于根据所述词频监控指标训练得到词频检测模型。The model training module 30 is used to train a word frequency detection model according to the word frequency monitoring indicator.

故障治愈模块40,用于通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈。The fault recovery module 40 is used to perform anomaly detection through the word frequency detection model and to recover the fault according to the detection result.

在本实施例中,采集应用日志文件的应用服务日志信息;根据所述应用服务日志信息配置日志词频分析策略;根据所述日志词频分析策略进行应用日志词频分析,得到词频监控指标;根据所述词频监控指标训练得到词频检测模型;通过所述词频检测模型进行异常检测,并根据所述检测结果进行故障治愈。通过这种方式,提供集中化的日志管理和监控功能,可以汇总和展示所有应用程序的日志数据识别潜在的问题和异常,并进行针对性的故障治愈。及时发现异常和故障,可以快速响应和处理。通过应用日志词频分析和算法模型的训练,实现故障治愈的自动化过程,减少人工干预的需求,提高运维效率和准确性,提升系统的性能和用户体验。In this embodiment, application service log information of application log files is collected; a log word frequency analysis strategy is configured according to the application service log information; application log word frequency analysis is performed according to the log word frequency analysis strategy to obtain word frequency monitoring indicators; a word frequency detection model is trained according to the word frequency monitoring indicators; anomaly detection is performed through the word frequency detection model, and fault recovery is performed according to the detection results. In this way, a centralized log management and monitoring function is provided, which can summarize and display the log data of all applications to identify potential problems and anomalies, and perform targeted fault recovery. Anomalies and faults can be discovered in a timely manner and can be responded to and handled quickly. By applying log word frequency analysis and training of algorithm models, the automated process of fault recovery is realized, reducing the need for manual intervention, improving operation and maintenance efficiency and accuracy, and enhancing system performance and user experience.

在一实施例中,所述指标分析模块20,还用于将所述应用服务日志信息输入到Kafka集群,并根据流处理应用程序库消费所述Kafka集群的集群应用日志,得到消费数据;根据所述日志词频分析策略确定日志词规则;根据所述消费数据和所述日志词规则生成日志词频监控指标。In one embodiment, the indicator analysis module 20 is also used to input the application service log information into the Kafka cluster, and consume the cluster application log of the Kafka cluster according to the stream processing application library to obtain consumption data; determine the log word rule according to the log word frequency analysis strategy; and generate a log word frequency monitoring indicator according to the consumption data and the log word rule.

在一实施例中,所述指标分析模块20,还用于将所述消费数据和所述日志词规则进行比较;根据比较结果确定合规日志数据;将所述合规日志数据存入目标集群,并通过日志词频统计服务和所述目标集群的集群数据生成日志词频监控指标。In one embodiment, the indicator analysis module 20 is also used to compare the consumption data with the log word rules; determine the compliant log data based on the comparison results; store the compliant log data in the target cluster, and generate a log word frequency monitoring indicator through the log word frequency statistics service and the cluster data of the target cluster.

在一实施例中,所述模型训练模块30,还用于获取历史故障记录,并根据所述历史故障记录和所述词频监控指标生成训练样本;根据所述训练样本训练得到词频检测模型。In one embodiment, the model training module 30 is further used to obtain historical fault records, and generate training samples according to the historical fault records and the word frequency monitoring index; and obtain a word frequency detection model by training according to the training samples.

在一实施例中,所述模型训练模块30,还用于根据所述词频监控指标确定原始指标数据时间序列;将所述原始指标数据时间序列采用滑动窗口的方式切分成多个样本窗口;根据所述历史故障记录确定异常窗口;根据所述样本窗口和所述异常窗口确定正常窗口;根据所述正常窗口确定训练样本。In one embodiment, the model training module 30 is also used to determine the original indicator data time series based on the word frequency monitoring indicator; divide the original indicator data time series into multiple sample windows using a sliding window; determine the abnormal window based on the historical fault record; determine the normal window based on the sample window and the abnormal window; and determine the training sample based on the normal window.

在一实施例中,所述故障治愈模块40,还用于将预测样本输入到所述词频检测模型,得到输出的异常样本;根据所述异常样本确定目标故障治愈动作,并通过触发所述目标故障治愈动作进行故障治愈。In one embodiment, the fault healing module 40 is further used to input the predicted sample into the word frequency detection model to obtain an output abnormal sample; determine a target fault healing action according to the abnormal sample, and perform fault healing by triggering the target fault healing action.

本发明智能故障治愈装置的其他实施例或具体实现方式可参照上述各方法实施例,此处不再赘述。Other embodiments or specific implementations of the intelligent fault healing device of the present invention can refer to the above-mentioned method embodiments and will not be described in detail here.

需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or system. In the absence of further restrictions, an element defined by the sentence "comprises a ..." does not exclude the existence of other identical elements in the process, method, article or system including the element.

上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are only for description and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如只读存储器/随机存取存储器、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as a read-only memory/random access memory, a magnetic disk, or an optical disk), and includes a number of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in each embodiment of the present invention.

以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the contents of the present invention specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present invention.

Claims (10)

1. An intelligent fault cure method, characterized in that the intelligent fault cure method comprises:
configuring a log word frequency analysis strategy according to application service log information of an acquired application log file;
performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes;
Training according to the word frequency monitoring index to obtain a word frequency detection model;
and performing abnormality detection through the word frequency detection model, and performing fault cure according to the detection result.
2. The intelligent fault healing method according to claim 1, wherein the applying log word frequency analysis according to the log word frequency analysis policy to obtain the word frequency monitoring index includes:
inputting the application service log information into a Kafka cluster, and consuming the cluster application log of the Kafka cluster according to a stream processing application program library to obtain consumption data;
Determining a log word rule according to the log word frequency analysis strategy;
And generating a log word frequency monitoring index according to the consumption data and the log word rule.
3. The intelligent fault-healing method of claim 2, wherein the generating a log word frequency monitor indicator from the consumption data and the log word rule comprises:
comparing the consumption data with the log word rule;
Determining compliance log data according to the comparison result;
And storing the compliance log data into a target cluster, and generating a log word frequency monitoring index through a log word frequency statistics service and cluster data of the target cluster.
4. The intelligent fault cure method of claim 1, wherein training according to the word frequency monitoring indicator to obtain a word frequency detection model comprises:
Acquiring a historical fault record, and generating a training sample according to the historical fault record and the word frequency monitoring index;
and training according to the training sample to obtain a word frequency detection model.
5. The intelligent fault-healing method of claim 4, wherein the generating training samples from the historical fault record and the word frequency monitoring indicator comprises:
Determining an original index data time sequence according to the word frequency monitoring index;
dividing the original index data time sequence into a plurality of sample windows in a sliding window mode;
Determining an abnormal window according to the historical fault record;
determining a normal window according to the sample window and the abnormal window;
And determining training samples according to the normal window.
6. The intelligent fault-healing method according to claim 1, wherein the abnormality detection by the word frequency detection model and the fault-healing according to the detection result comprise:
inputting the prediction sample into the word frequency detection model to obtain an output abnormal sample;
And determining a target fault healing action according to the abnormal sample, and performing fault healing by triggering the target fault healing action.
7. An intelligent fault cure device, characterized in that the intelligent fault cure device comprises:
the strategy configuration module is used for configuring a log word frequency analysis strategy according to the application service log information of the acquired application log file;
the index analysis module is used for carrying out application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes;
the model training module is used for training according to the word frequency monitoring index to obtain a word frequency detection model;
and the fault cure module is used for carrying out abnormal detection through the word frequency detection model and carrying out fault cure according to the detection result.
8. An intelligent fault cure device, the device comprising: a memory, a processor, and a smart fault-healing program stored on the memory and executable on the processor, the smart fault-healing program configured to implement the steps of the smart fault-healing method of any one of claims 1 to 6.
9. A storage medium having stored thereon a smart fault cure program which when executed by a processor implements the steps of the smart fault cure method of any one of claims 1 to 6.
10. A computer program product comprising a smart fault cure program which when executed by a processor implements the steps of the smart fault cure method according to any one of claims 1 to 6.
CN202410187094.1A 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product Pending CN118093311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410187094.1A CN118093311A (en) 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410187094.1A CN118093311A (en) 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN118093311A true CN118093311A (en) 2024-05-28

Family

ID=91162565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410187094.1A Pending CN118093311A (en) 2024-02-19 2024-02-19 Intelligent fault healing method, device, equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN118093311A (en)

Similar Documents

Publication Publication Date Title
CN105426292B (en) A kind of games log real time processing system and method
US11539590B2 (en) Detect impact of network maintenance in software defined infrastructure
CN106940679A (en) Data processing method and device
US20130212257A1 (en) Computer program and monitoring apparatus
CN112084055A (en) Fault locating method, device, electronic device and storage medium for application system
CN105471647B (en) A kind of power communication network fault positioning method
CN113010393A (en) Fault drilling method and device based on chaotic engineering
CN111259073A (en) An intelligent judgment system for business system running status based on logs, traffic and business access
WO2023224764A1 (en) Multi-modality root cause localization for cloud computing systems
CN112051771B (en) Multi-cloud data acquisition method and device, computer equipment and storage medium
CN117112339A (en) Abnormality detection method, abnormality detection device, electronic device, and computer program product
CN116074215B (en) Network quality detection method, device, equipment and storage medium
CN117371773A (en) Business process arranging method, device, electronic equipment and medium
WO2021143483A1 (en) System maintenance method and apparatus, device, and storage medium
CN118709184B (en) Malicious code escape detection method and device
Hou et al. A federated learning‐based fault detection algorithm for power terminals
CN111130882A (en) Monitoring system and method of network equipment
CN110138720B (en) Method and device for detecting abnormal classification of network traffic, storage medium and processor
CN118484356A (en) A server status monitoring method and system based on RPA
CN118828653A (en) Wireless base station hidden danger prediction method, electronic device and computer readable storage medium
CN118093311A (en) Intelligent fault healing method, device, equipment, storage medium and product
CN117376092A (en) Fault root cause positioning method, device, equipment and storage medium
CN116028811A (en) Data backtracking method, medium, device and computing equipment
CN114153714A (en) Method, device, device and storage medium for capacity adjustment based on log information
Khichane et al. 5GC-Analyser: Demistifying the 5G Core Network Through Statistical Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination