CN108038049A - Real-time logs control system and control method, cloud computing system and server - Google Patents

Real-time logs control system and control method, cloud computing system and server Download PDF

Info

Publication number
CN108038049A
CN108038049A CN201711333074.7A CN201711333074A CN108038049A CN 108038049 A CN108038049 A CN 108038049A CN 201711333074 A CN201711333074 A CN 201711333074A CN 108038049 A CN108038049 A CN 108038049A
Authority
CN
China
Prior art keywords
fault
log
mrow
sequence
msub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711333074.7A
Other languages
Chinese (zh)
Other versions
CN108038049B (en
Inventor
裴庆祺
赵伟伟
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201711333074.7A priority Critical patent/CN108038049B/en
Publication of CN108038049A publication Critical patent/CN108038049A/en
Application granted granted Critical
Publication of CN108038049B publication Critical patent/CN108038049B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明属于云计算技术领域,公开了一种实时日志控制系统及控制方法、云计算系统及服务器,通过对于日志记录事件的分析,将错误信息进行分类、过滤、聚合操作,提取成为序列,训练故障模型并计算序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。本发明通过对于日志记录事件的分析,将所有的错误信息进行分类、过滤、聚合等操作,提取成为序列,训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测,比起大量的规则匹配来说提高了判断速度;故障预测研究对于减轻网络管理和维护的负担,减少网络故障造成的损失具有重要意义。

The invention belongs to the technical field of cloud computing, and discloses a real-time log control system and control method, a cloud computing system and a server. Through the analysis of log recording events, error information is classified, filtered, aggregated, extracted into sequences, and trained Fault model and calculate the probability of the sequence belonging to the fault sequence and the probability of the non-fault sequence, and use Bayesian classification theory to obtain the results and make predictions. The present invention classifies, filters, and aggregates all error information through the analysis of log recording events, extracts them into sequences, trains the fault model and calculates the probability that the sequence belongs to a fault sequence and the probability of a non-fault sequence, using Bayeux Compared with a large number of rule matching, the results and predictions of Adams classification theory have improved the speed of judgment; fault prediction research is of great significance for reducing the burden of network management and maintenance and reducing the loss caused by network faults.

Description

实时日志控制系统及控制方法、云计算系统及服务器Real-time log control system and control method, cloud computing system and server

技术领域technical field

本发明属于云计算技术领域,尤其涉及一种实时日志控制系统及控制方法、云计算系统及服务器。The invention belongs to the technical field of cloud computing, and in particular relates to a real-time log control system and control method, a cloud computing system and a server.

背景技术Background technique

随着计算机技术的高速发展,云计算成为最重要的计算机领域之一,云计算服务深入到每个人的生活和工作当中。能够通过对实时数据的计算,基于机器学习算法对于云计算系统中可能发生的故障进行提前预测,预留出故障响应时间,同时还支持弹性地平扩展集群的处理能力,以适应不断增长的数据量和用户需求。对海量日志数据进行实时计算处理,从数据中挖掘分析出系统的状态、故障预测方面具有良好的发展方向和应用前景。With the rapid development of computer technology, cloud computing has become one of the most important computer fields, and cloud computing services have penetrated into everyone's life and work. Through the calculation of real-time data, based on the machine learning algorithm, it can predict the possible faults in the cloud computing system in advance, reserve the fault response time, and also support the elastic horizontal expansion of the processing capacity of the cluster to adapt to the growing data volume and user needs. Real-time calculation and processing of massive log data, mining and analysis of system status and fault prediction from the data has a good development direction and application prospect.

综上所述,现有技术存在的问题是:原有的故障预测模型中,一方面,状态持续时间分布大多默认为指数型分布,而实际中故障的状态概率变化并不满足指数型;另一方面,在故障状态检测值概率做了离散化处理,这对大数据环境进行实验分析会有意料之外的影响,故本内容采用状态持续时间分布和状态观察值概率分布钧进行连续化分布即假定威布尔分布,采用改进的预测模型可提高诊断和预测的概率值。To sum up, the problems existing in the existing technology are: in the original fault prediction model, on the one hand, the state duration distribution is mostly exponential distribution by default, but the state probability change of the fault does not satisfy the exponential type in reality; On the one hand, the detection value probability of the fault state has been discretized, which will have an unexpected impact on the experimental analysis of the big data environment. Therefore, this content uses the state duration distribution and the state observation value probability distribution to carry out continuous distribution. That is, assuming Weibull distribution, the probability value of diagnosis and prediction can be improved by adopting an improved prediction model.

发明内容Contents of the invention

针对现有技术存在的问题,本发明提供了一种实时日志控制系统及控制方法、云计算系统及服务器。Aiming at the problems existing in the prior art, the present invention provides a real-time log control system and control method, a cloud computing system and a server.

本发明是这样实现的,一种实时日志控制方法,所述实时日志控制方法通过对于日志记录事件的分析,将错误信息进行分类、过滤、聚合操作,提取成为序列,训练故障模型并计算序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。The present invention is achieved in this way, a real-time log control method, the real-time log control method classifies, filters, and aggregates error information through the analysis of log record events, extracts them into sequences, trains fault models and calculates the sequence belongs to The probability of a fault sequence and the probability of a non-fault sequence are derived using Bayesian classification theory to make predictions.

进一步,所述实时日志控制方法具体包括:Further, the real-time log control method specifically includes:

步骤一,收集分布式系统中各个节点上的日志文件数据,通过增量检查将新产生日志数据实时地发送给收集端;Step 1: Collect log file data on each node in the distributed system, and send newly generated log data to the collector in real time through incremental checks;

步骤二,删除在某一时间段内相同位置报告的相同类型事件,删除冗余事件,通过设置时间阈值表示用于执行事件过滤的时间窗口;通过移除某一时间段内由多个不同位置报告的相似事件,删除日志中的冗余事件,将数据流保存到时序数据库中;使用相似性Sim(D1,D2)来判断:Step 2, delete the same type of events reported at the same location within a certain period of time, delete redundant events, by setting the time threshold Represents the time window used to perform event filtering; by removing similar events reported by multiple different locations within a certain period of time, delete redundant events in the log, and save the data stream into the time series database; use the similarity Sim( D 1 , D 2 ) to judge:

其中D1,D2表示两个序列,W1K,W2K表示D1、D2序列的向量项,相似度即两个向量夹角的余弦值来表示,Sim(D1,D2)越大,表示两者相似度越高;Among them, D 1 and D 2 represent two sequences, W 1K and W 2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D 1 , D 2 ), Indicates the higher the similarity between the two;

步骤三,在每条数据存储到数据表时,利用SQL语句按照时间戳、进程号、记录级别、进程模块、分隔符、记录信息分割记录;Step 3, when each piece of data is stored in the data table, use the SQL statement to divide the records according to the timestamp, process number, record level, process module, delimiter, and record information;

步骤四,利用SQL语句将处理过的标准格式化数据进行持久化存储;Step 4, using SQL statements to persist the processed standard formatted data;

步骤五,提取日志故障序列;Step 5, extracting the log fault sequence;

步骤六,聚类标准根据序列的似然值作为度量值来计算,采用层次聚类算法实现故障相关事件分组,其中:Step six, the clustering criteria are based on the likelihood value of the sequence Computed as a metric, a hierarchical clustering algorithm is used to group fault-related events, where:

S=[si]表示一个长为L状态序列,为在状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵;S=[s i ] represents a long L state sequence, is the probability matrix of the observed values under the initial state probability vector π=[π i ] in the state s i (k);

步骤七,采用改进的HSMM和贝叶斯网络BayesNet相结合,对实时日志数据做出故障预测;Step seven, using the combination of improved HSMM and Bayesian network BayesNet to make fault prediction for real-time log data;

标准HSMM可由状态之间转化概率矩阵G(t)=[gij(t)]、状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵B=bi(k),定义为将状态持续时间概率分布连续化;将状态持续时间的分布作为连续分布来处理,并且假设其服从威布尔分布来描述状态持续时间概率分布,状态的状态持续时间概率分布fi(l)为:The standard HSMM can be converted from state to state by probability matrix G(t)=[g ij (t)], state s i (k) under initial state probability vector π=[π i ], probability matrix B= bi (k), defined as The state duration probability distribution is continuous; the state duration distribution is treated as a continuous distribution, and it is assumed to obey the Weibull distribution to describe the state duration probability distribution. The state state duration probability distribution f i (l) is:

fi(l)=αβ(αl)β-1e-(αl)βf i (l) = αβ(αl) β-1 e -(αl)β ;

式中:α、β分别为威布尔分布的尺度参数和形状参数;In the formula: α and β are the scale parameter and shape parameter of Weibull distribution respectively;

将状态监测值概率分布连续化;同样设定其服从威布尔分布,状态检测值概率分布函数ξi(θ)为:The probability distribution of the state monitoring value is continuous; it is also set to obey the Weibull distribution, and the probability distribution function ξ i (θ) of the state detection value is:

其中αi、βi为各状态阶段的威布尔分布的参数;改进的HSMM模型可描述为 Among them, α i and β i are the parameters of Weibull distribution in each state stage; the improved HSMM model can be described as

步骤八,故障和非故障模型进行训练,参数目标是评估,给定一个观察序列O=[o1,o2,...,ol]是否为故障相关序列;计算分类模型的序列似然值,随后被分类为无故障或故障贝叶斯决策理论;Step 8, fault and non-fault models are trained, parameters and The goal is to evaluate, given a sequence of observations O = [o 1 , o 2 , ..., o l ], whether it is a fault-related sequence; compute the sequence likelihood for a classification model, subsequently classified as fault-free or fault-Bayes Adams decision theory;

步骤九,故障结果预判:Step 9, predict the failure result:

将一个序列标记成为故障相关事件序列,系统发出故障预测;其中表示错误的将故障相关序列判断成为故障无关序列的代价,P(F)表示故障的概率,表示对序列似然值取对数。A sequence is marked as a fault-related event sequence, and the system issues a fault prediction; where Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates taking the logarithm of the sequence likelihood.

进一步,所述提取日志故障序列具体包括:Further, the extraction log failure sequence specifically includes:

第一步,提取错误事件序列:利用SQL语句,根据日志等级将ERROR级别的记录过提取出来,保留时间戳和文本消息信息;The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain the timestamp and text message information;

第二步,合并相似错误事件:对事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;最小编辑距离包含子最小编辑距离;The second step is to merge similar error events: use the Levenshtein edit distance algorithm for event sequences to merge error events with greater similarity; the minimum edit distance includes sub-minimum edit distance;

其中d[i-1,j]+1代表目标日志插入一个字母,d[i,j-1]-1代表匹配日志删除一个字母;然后xi=yj时,不需要修改,所以和上一步d[i-1,j-1]+1代价相同,否则+1,d[i,j]表示以上三者中最小的一项;Among them, d [i-1, j] + 1 means that the target log inserts a letter, and d [i, j-1] -1 means that the matching log deletes a letter; then when x i = y j , no modification is required, so it is the same as above The cost of one step d [i-1, j-1] +1 is the same, otherwise +1, d [i, j ] represents the smallest item among the above three;

第三步,错误事件分类:经过上一步将错误事件合并后,根据错误事件的文本信息中的关键字将相似的错误事件进行归类,并赋值ID,保存在数据库中;The third step is to classify error events: after merging error events in the previous step, classify similar error events according to the keywords in the text information of error events, assign IDs, and store them in the database;

第四步,提取序列:按照时间顺序,提取在故障发生前一段时间内的事件,设定为故障相关事件序列,为故障前置时间,当前故障事件为相关故障事件;非故障相关事件序列则是在系统未发生故障的时间区间内的事件序列。The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs events within, set to a sequence of fault-related events, is the fault lead time, the current fault event is the related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail.

本发明的另一目的在于提供一种所述实时日志控制方法的实时日志控制系统,所述实时日志控制系统包括:日志信息处理模块、日志故障分析模块。Another object of the present invention is to provide a real-time log control system according to the real-time log control method. The real-time log control system includes: a log information processing module and a log failure analysis module.

进一步,所述日志故障分析模块包括:Further, the log failure analysis module includes:

收集日志信息单元,用于收集分布式系统中各个节点上的日志文件数据,日志收集功能应该允许自定义所要监听的日志文件,通过增量检查的方法,将新产生日志数据实时地发送给收集端;The log information collection unit is used to collect the log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send the newly generated log data to the collection in real time through the incremental check method. end;

日志信息过滤单元,用于进行数据的去冗余和过滤;The log information filtering unit is used for de-redundancy and filtering of data;

日志信息标准格式化单元,用于处理过的日志信息进行数据标准格式化;The log information standard formatting unit is used for data standard formatting of the processed log information;

日志存储单元,用于将处理过的标准格式化数据进行持久化存储。The log storage unit is used for persistent storage of the processed standard formatted data.

进一步,所述日志故障分析模块包括:Further, the log failure analysis module includes:

提取日志事件序列单元;extract log event sequence unit;

故障相关事件聚类单元,用于利用事件提前训练出一个小的隐半马尔可夫模型,求序列似然值;The fault-related event clustering unit is used to use events to train a small hidden semi-Markov model in advance to calculate the sequence likelihood value;

故障预测单元,使用隐半马尔可夫模型和贝叶斯分贝理论,判定序列是否为故障相关序列;The fault prediction unit uses the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence;

故障结果判断输出单元:当判定为故障相关序列时,系统发出故障警告流,输出状态故障预警。Fault result judgment output unit: When it is judged to be a fault-related sequence, the system sends out a fault warning flow and outputs a status fault warning.

所述提取日志事件序列单元进一步包括:The extraction log event sequence unit further includes:

提取错误事件记录单元,根据日志等级将ERROR级别的记录过提取出来,保留时间戳、进程模块和文本消息信息;Extract the error event record unit, extract the records of the ERROR level according to the log level, and retain the time stamp, process module and text message information;

合并相似错误事件单元,将错误事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;Merge similar error event units, and use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity;

错误事件分类单元,对事件序列采用Levenshtein编辑距离算法,将相似的错误事件进行归类,并赋值ID;The error event classification unit uses the Levenshtein edit distance algorithm for the event sequence to classify similar error events and assign IDs;

提取故障相关序列单元,按照时间先后顺序,提取故障前一段时间内的事件,设定为故障前置事件。Extract fault-related sequence units, and extract events in a period of time before the fault according to the chronological order, and set them as fault pre-events.

本发明的另一目的在于提供一种利用所述实时日志控制方法的云计算系统。Another object of the present invention is to provide a cloud computing system utilizing the real-time log control method.

现今故障预测研究工作主要有三类方法,包括基于日志频率的故障检测模型,基于消息频率的故障检测模型和基于状态转移的故障检测模型。There are three main types of fault prediction research work today, including fault detection models based on log frequency, fault detection models based on message frequency, and fault detection models based on state transition.

本发明在系统运行时间内实时收集日志信息并进行聚类处理,通过分析事件日志使用机器学习的算法和模型,实现对系统未来可能发生的故障的预测,在系统运行过程中对系统故障进行提前排查和定位,用于提高系统运维效率和预防紧急故障事件。本发明通过对于日志记录事件的分析,将所有的错误信息进行分类、过滤、聚合等操作,提取成为序列,训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。The present invention collects log information in real time during system operation time and performs clustering processing, and uses machine learning algorithms and models by analyzing event logs to realize the prediction of possible future failures of the system, and to predict system failures in advance during system operation Troubleshooting and positioning are used to improve system operation and maintenance efficiency and prevent emergency failures. The present invention classifies, filters, and aggregates all error information through the analysis of log recording events, extracts them into sequences, trains the fault model and calculates the probability that the sequence belongs to a fault sequence and the probability of a non-fault sequence, using Bayeux The theory of Adams classification draws results and makes predictions.

该方法的有效判断标准主要由三个参数来决定,即准确率、召回率以及F-measure参数,准确率反应的是所有预测中正确的比率,召回率反应的是所有故障中被正确预测出来的比率,F.measure是结合准确率和召回率的一个综合衡量值;The effective judgment standard of this method is mainly determined by three parameters, namely the accuracy rate, recall rate and F-measure parameter. The accuracy rate reflects the correct ratio of all predictions, and the recall rate reflects the correct prediction of all faults. The ratio of F.measure is a comprehensive measure combining precision and recall;

预测情况如下表1:The predictions are as follows in Table 1:

预测结果\实际结果Predicted Results\Actual Results 系统故障system error 系统正常The system is normal 系统故障system error TruePositive(TP)TruePositive (TP) FalsePositive(FP)False Positive (FP) 系统正常The system is normal FalseNegative(FN)False Negative (FN) TrueNegative(TN)True Negative (TN)

表1预测情况Table 1 Forecast

预测有效性参数如表2:The predictive validity parameters are shown in Table 2:

表2有效性参数表达式Table 2 Validity parameter expression

经过系统实验得出下面数据结论,可看出本次系统在准确率上优于未改进之前After the system experiment, the following data conclusions can be drawn. It can be seen that the accuracy of this system is better than that before no improvement.

附图说明Description of drawings

图1是本发明实施例提供的实时日志控制系统结构示意图;Fig. 1 is a schematic structural diagram of a real-time log control system provided by an embodiment of the present invention;

图中:1、日志信息处理模块;1-1、收集日志信息单元;1-2、日志信息过滤单元;1-3、日志信息标准格式化单元;1-4、日志存储单元;2、日志故障分析模块;2-1、提取日志事件序列单元;2-1-1、提取错误事件记录单元;2-1-2、合并相似错误事件单元;2-1-3、错误事件分类单元;2-1-4、提取故障相关序列单元;2-2、故障相关事件聚类单元;2-3、故障预测单元;2-4、故障结果判断输出单元。In the figure: 1. Log information processing module; 1-1. Collecting log information unit; 1-2. Log information filtering unit; 1-3. Log information standard formatting unit; 1-4. Log storage unit; 2. Log Fault analysis module; 2-1, extraction log event sequence unit; 2-1-1, extraction error event recording unit; 2-1-2, merging similar error event unit; 2-1-3, error event classification unit; 2 -1-4. Extracting fault-related sequence unit; 2-2. Fault-related event clustering unit; 2-3. Fault prediction unit; 2-4. Fault result judgment output unit.

图2是本发明实施例提供的实时日志控制方法流程图。Fig. 2 is a flowchart of a real-time log control method provided by an embodiment of the present invention.

图3是本发明实施例提供的实时日志控制方法的实现流程图。Fig. 3 is a flow chart of realizing the real-time log control method provided by the embodiment of the present invention.

图4是本发明实施例提供的故障序列提取示意图。Fig. 4 is a schematic diagram of fault sequence extraction provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

下面结合附图对本发明的应用原理作详细的描述。The application principle of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示,本发明实施例提供的实时日志控制系统包括:日志信息处理模块1、日志故障分析模块2。As shown in FIG. 1 , the real-time log control system provided by the embodiment of the present invention includes: a log information processing module 1 and a log fault analysis module 2 .

日志故障分析模块1包括:Log failure analysis module 1 includes:

收集日志信息单元1-1:用于收集分布式系统中各个节点上的日志文件数据,日志收集功能应该允许自定义所要监听的日志文件,通过增量检查的方法,将新产生日志数据实时地发送给收集端。Collecting log information unit 1-1: used to collect log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored. Through incremental checking, the newly generated log data will be collected in real time. sent to the collector.

日志信息过滤单元1-2:用于进行数据的去冗余和过滤。Log information filtering unit 1-2: used for de-redundancy and filtering of data.

日志信息标准格式化单元1-3:用于处理过的日志信息进行数据标准格式化,比如按照:时间戳、进程号、记录级别、进程模块、分隔符、记录信息,其中,记录级别分为几大类,包括:ERROR、WARING、TRACE、INFO、DUBUG、CRITICAL、AUDIT,级别越靠前等级越高,等级越高代表事件的重要程度越高。Log information standard formatting unit 1-3: used for data standard formatting of processed log information, for example, according to: timestamp, process number, record level, process module, separator, record information, where the record level is divided into Several categories, including: ERROR, WARING, TRACE, INFO, DUBUG, CRITICAL, AUDIT, the higher the level, the higher the level, and the higher the level, the higher the importance of the event.

日志存储单元1-4:用于将处理过的标准格式化数据进行持久化存储,便于后期数据的提取分析。Log storage unit 1-4: used for persistent storage of processed standard formatted data, which is convenient for later data extraction and analysis.

日志故障分析模块2包括:Log failure analysis module 2 includes:

提取日志事件序列单元2-1:Extract log event sequence unit 2-1:

故障相关事件聚类单元2-2,用于利用事件提前训练出一个小的隐半马尔可夫(HSMM)模型,求序列似然值即给定序列利用训练模型产生的观察序列;The fault-related event clustering unit 2-2 is used to train a small hidden semi-Markov (HSMM) model in advance by using the event, and calculate the sequence likelihood value, that is, the observation sequence generated by the training model for a given sequence;

故障预测单元2-3:使用隐半马尔可夫模型和贝叶斯分贝理论,判定序列是否为故障相关序列;Fault prediction unit 2-3: use the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence;

故障结果判断输出单元2-4:当判定为故障相关序列时,系统发出故障警告流,输出状态故障预警。Fault result judgment output unit 2-4: When it is judged to be a fault-related sequence, the system sends a fault warning flow and outputs a status fault warning.

提取日志事件序列单元2-1进一步包括:Extract log event sequence unit 2-1 further includes:

提取错误事件记录单元2-1-1:根据日志等级将ERROR级别的记录过提取出来,保留时间戳、进程模块和文本消息等信息;Extract error event record unit 2-1-1: Extract the records of ERROR level according to the log level, and retain information such as time stamp, process module and text message;

合并相似错误事件单元2-1-2:将错误事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;Merge similar error events unit 2-1-2: Use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity;

错误事件分类单元2-1-3:对事件序列采用Levenshtein编辑距离算法,将相似的错误事件进行归类,并赋值ID;Error event classification unit 2-1-3: Use the Levenshtein edit distance algorithm for event sequences to classify similar error events and assign IDs;

提取故障相关序列单元2-1-4:按照时间先后顺序,提取故障前一段时间内的事件,设定为故障前置事件。Extract fault-related sequence unit 2-1-4: According to the chronological order, extract the events within a period of time before the fault, and set it as the pre-fault event.

如图2所示,本发明实施例提供的实时日志控制方法包括以下步骤:As shown in Figure 2, the real-time log control method provided by the embodiment of the present invention includes the following steps:

S201:通过对于日志记录事件的分析,将所有的错误信息进行分类、过滤、聚合等操作,提取成为序列;S201: through the analysis of the log record event, classify, filter, aggregate and other operations are performed on all the error information, and extract them into a sequence;

S202:训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。S202: Train the fault model and calculate the probability that the sequence belongs to the fault sequence and the probability of the non-fault sequence, and use Bayesian classification theory to obtain the result and make a prediction.

下面结合附图对本发明的应用原理作进一步的描述。The application principle of the present invention will be further described below in conjunction with the accompanying drawings.

比起利用故障关键字进行大量的规则匹配来说,在本发明中,采用改进的HSMM(隐马尔科夫模型)和Bayesdecisiontheory(贝叶斯分类理论),直接计算一个错误序列属于故障序列的概率,提高判断速度。Compared with utilizing fault keywords to carry out a large amount of rule matching, in the present invention, adopt improved HSMM (hidden Markov model) and Bayesdecision theory (Bayesian classification theory), directly calculate the probability that an error sequence belongs to fault sequence , improve the speed of judgment.

如图3所示,本发明实施例提供的实时日志控制方法具体步骤如下:As shown in Figure 3, the specific steps of the real-time log control method provided by the embodiment of the present invention are as follows:

1、日志信息处理过程1. Log information processing process

步骤1,日志信息收集Step 1, log information collection

系统应该能够收集分布式系统中各个节点上的日志文件数据,日志收集功能应该允许自定义所要监听的日志文件,通过增量检查的方法,即将新产生日志数据实时地发送给收集端。The system should be able to collect log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send newly generated log data to the collector in real time through incremental checking.

步骤2,日志信息过滤Step 2, log information filtering

有两种方法:一个是时间过滤,另一个是空间过滤。当系统检测到异常时,在系统发生故障之前,系统会持续输出警告信息流。同样地,一旦系统发生故障,在解决故障问题之前日志中可能会多次反复出现故障信息。There are two methods: one is temporal filtering and the other is spatial filtering. When the system detects an anomaly, the system will continuously output a stream of warning information until the system fails. Likewise, once a system fails, the failure message may appear repeatedly in the log many times until the failure problem is resolved.

时间过滤方法通过删除在某一时间段内相同位置报告的相同类型事件,从而删除冗余事件,通过设置时间阈值表示用于执行事件过滤的时间窗口。空间过滤方法通过移除某一时间段内由多个不同位置报告的相似事件,删除日志中的冗余事件,将数据流保存到时序数据库中,节省空间并提高效率。通常使用相似性Sim(D1,D2)来判断:The time filtering method removes redundant events by removing events of the same type reported at the same location within a certain period of time, by setting a time threshold Indicates the time window used to perform event filtering. The spatial filtering method saves space and improves efficiency by removing similar events reported by multiple different locations within a certain period of time, deleting redundant events in logs, and saving data streams to time series databases. Usually use the similarity Sim(D 1 , D 2 ) to judge:

其中D1,D2表示两个序列,W1K,W2K表示D1、D2序列的向量项,相似度即两个向量夹角的余弦值来表示,Sim(D1,D2)越大,表示两者相似度越高。Among them, D 1 and D 2 represent two sequences, W 1K and W 2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D 1 , D 2 ), Indicates the higher the similarity between the two.

步骤3,日志格式标准化。Step 3, log format standardization.

在将每条数据存储到数据表时,利用SQL语句按照时间戳、进程号、记录级别、进程模块、分隔符、记录信息等分割记录。When storing each piece of data in the data table, use SQL statements to divide records according to timestamp, process number, record level, process module, delimiter, record information, etc.

步骤4,日志存储。Step 4, log storage.

利用SQL语句将处理过的标准格式化数据进行持久化存储,便于后期数据的提取分析。Use SQL statements to store the processed standard formatted data persistently, which is convenient for later data extraction and analysis.

2.日志故障分析:2. Log failure analysis:

在故障表现和系统状态之间建立基于概率因果关系,通过故障出现的先验概率来对隐半马尔科夫模型和贝叶斯网络进行训练,诊断时根据先验概率求解故障表现下各种系统状态的后验概率,直观表达变量的联合概率分布,同时计算各特征造成故障的概率。Establish a probability-based causal relationship between the fault performance and the system state, train the hidden semi-Markov model and Bayesian network through the prior probability of fault occurrence, and solve various systems under the fault performance according to the prior probability when diagnosing The posterior probability of the state, intuitively expresses the joint probability distribution of the variables, and calculates the probability of failure caused by each feature at the same time.

步骤1,提取日志故障序列。Step 1, extract the log fault sequence.

第一步,提取错误事件序列:利用SQL语句,根据日志等级将ERROR级别的记录过提取出来,保留时间戳和文本消息等信息;The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain information such as time stamps and text messages;

第二步,合并相似错误事件:对上一步骤的事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;The second step is to merge similar error events: use the Levenshtein edit distance algorithm for the event sequence in the previous step to merge the error events with greater similarity;

该算法使用了动态规划的算法策略,该问题具备最优子结构,最小编辑距离包含子最小编辑距离;The algorithm uses the algorithm strategy of dynamic programming, the problem has an optimal substructure, and the minimum edit distance includes sub-minimum edit distance;

其中d[i-1,j]+1代表目标日志插入一个字母,d[i,j-1]+1代表匹配日志删除一个字母;然后xi=yj时,不需要修改,所以和上一步d[i-1,j-1]+1代价相同,否则+1,d[i,j]表示以上三者中最小的一项;Among them, d [i-1, j] + 1 means that the target log inserts a letter, and d [i, j-1] + 1 means that the matching log deletes a letter; then when x i = y j , no modification is required, so it is the same as above The cost of one step d [i-1, j-1] +1 is the same, otherwise +1, d [i, j] represents the smallest item among the above three;

第三步,错误事件分类:经过上一步将错误事件合并后,根据错误事件的文本信息中的关键字将相似的错误事件进行归类,并赋值ID,保存在数据库中;The third step is to classify error events: after merging error events in the previous step, classify similar error events according to the keywords in the text information of error events, assign IDs, and store them in the database;

第四步,提取序列:按照时间顺序,提取在故障发生前一段时间为的事件,设定为故障相关事件序列,为故障前置时间,当前故障事件为相关故障事件;非故障相关事件序列则是在系统未发生故障的时间区间内的事件序列,如图4所示:The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs For the event, set as the sequence of fault-related events, is the fault lead time, and the current fault event is a related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail, as shown in Figure 4:

步骤2,故障相关事件聚类。Step 2, clustering of fault-related events.

实际中,会有多种的故障相关事件序列可能导致同一种的系统故障,而这多种故障相关事件序列的特征是不同的,故需要进行聚类。In practice, there will be a variety of fault-related event sequences that may lead to the same system fault, and the characteristics of these various fault-related event sequences are different, so clustering is required.

聚类标准可根据序列的似然值作为度量值来计算,最后采用层次聚类算法实现故障相关事件分组,其中:The clustering criteria can be based on the likelihood value of the sequence Calculated as a metric, and finally a hierarchical clustering algorithm is used to group fault-related events, where:

S=[si]表示一个长为L状态序列,bsi(oi)为在状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵。S=[s i ] represents a state sequence of length L, and b si (o i ) is the probability matrix of observed values in state s i (k) under the initial state probability vector π=[π i ].

步骤3,训练建立预测模型。Step 3, training and building a prediction model.

预测模型是网络故障预测的关键,所构造的特征直接影响预测模型的性能。本次采用隐半马尔可夫模型(HSMM)和贝叶斯网络(Bayes Net)相结合,针对实时日志数据做出故障预测。The prediction model is the key to network fault prediction, and the constructed features directly affect the performance of the prediction model. This time, a combination of Hidden Semi-Markov Model (HSMM) and Bayesian Network (Bayes Net) is used to make fault prediction for real-time log data.

标准HSMM可由状态之间转化概率矩阵G(t)=[gij(t)]、状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵B=bi(k),定义为 The standard HSMM can be converted from state to state by probability matrix G(t)=[g ij (t)], state s i (k) under initial state probability vector π=[π i ], probability matrix B= bi (k), defined as

本次对HSMM的改进方面有:将状态持续时间概率分布连续化。将状态持续时间的分布作为连续分布来处理,并且假设其服从威布尔分布来描述状态持续时间概率分布,即状态的状态持续时间概率分布fi(l)为:The improvements to HSMM this time include: continuous state duration probability distribution. Treat the distribution of the state duration as a continuous distribution, and assume that it obeys the Weibull distribution to describe the probability distribution of the state duration, that is, the state duration probability distribution f i (l) of the state is:

fi(l)=αβ(αl)β-1e-(αl)βf i (l) = αβ(αl) β-1 e -(αl)β ;

式中:d、β分别为威布尔分布的尺度参数和形状参数;In the formula: d and β are the scale parameter and shape parameter of Weibull distribution respectively;

将状态监测值概率分布连续化。同样设定其服从威布尔分布,状态检测值概率分布函数ξi(θ)为:The probability distribution of state monitoring values is continuous. It is also assumed that it obeys the Weibull distribution, and the probability distribution function ξ i (θ) of the state detection value is:

其中αi、βi为各状态阶段的威布尔分布的参数;故改进的HSMM模型可描述为 Among them, α i and β i are the parameters of Weibull distribution in each state stage; therefore, the improved HSMM model can be described as

步骤4,故障预测。Step 4, fault prediction.

假设的故障和非故障模型进行训练,即参数目标是评估,给定一个观察序列(错误序列)O=[o1,o2,...,ol]是否为故障相关序列。首先计算分类模型的序列似然值,随后被分类为无故障或故障贝叶斯决策理论。Assumed faulty and non-faulty models to train, i.e. parameters and The goal is to evaluate, given an observation sequence (error sequence) O=[o 1 , o 2 , . . . , o l ], whether it is a fault-related sequence. Sequence likelihoods are first computed for classification models, which are subsequently classified as failure-free or failure-based Bayesian decision theory.

步骤5,故障结果预判:Step 5, fault result prediction:

上面公式成立时,将一个序列标记成为故障相关事件序列,系统发出故障预测。其中表示错误的将故障相关序列判断成为故障无关序列的代价,P(F)表示故障的概率,表示对序列似然值取对数,这样可防止序列似然值太小而发生溢出问题。通过这样的方法,可以对每个序列进行判断,做出故障预测。When the above formula is established, a sequence is marked as a fault-related event sequence, and the system issues a fault prediction. in Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates that the logarithm of the sequence likelihood value is taken, which can prevent the overflow problem from occurring when the sequence likelihood value is too small. Through such a method, each sequence can be judged and a fault prediction can be made.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims (9)

1.一种实时日志控制方法,其特征在于,所述实时日志控制方法通过对于日志记录事件的分析,将错误信息进行分类、过滤、聚合操作,提取成为序列,训练故障模型并计算序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。1. A real-time log control method, characterized in that, the real-time log control method classifies, filters, and aggregates error information through the analysis of the log record event, extracts it as a sequence, trains the fault model and calculates that the sequence belongs to the fault The probabilities of sequences and the probabilities of non-failure sequences are derived using Bayesian classification theory to make predictions. 2.如权利要求1所述的实时日志控制方法,其特征在于,所述实时日志控制方法具体包括:2. The real-time log control method according to claim 1, wherein the real-time log control method specifically comprises: 步骤一,收集分布式系统中各个节点上的日志文件数据,通过增量检查将新产生日志数据实时地发送给收集端;Step 1: Collect log file data on each node in the distributed system, and send newly generated log data to the collector in real time through incremental checks; 步骤二,删除在某一时间段内相同位置报告的相同类型事件,删除冗余事件,通过设置时间阈值表示用于执行事件过滤的时间窗口;通过移除某一时间段内由多个不同位置报告的相似事件,删除日志中的冗余事件,将数据流保存到时序数据库中;使用相似性Sim(D1,D2)来判断:Step 2, delete the same type of events reported at the same location within a certain period of time, delete redundant events, by setting the time threshold Represents the time window used to perform event filtering; by removing similar events reported by multiple different locations within a certain period of time, delete redundant events in the log, and save the data stream into the time series database; use the similarity Sim( D 1 , D 2 ) to judge: <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>D</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>c</mi> <mi>o</mi> <mi>s</mi> <mi>&amp;theta;</mi> <mo>=</mo> <mfrac> <mrow> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>-</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>W</mi> <mrow> <mn>1</mn> <mi>K</mi> </mrow> </msub> <mo>&amp;times;</mo> <msub> <mi>W</mi> <mrow> <mn>2</mn> <mi>K</mi> </mrow> </msub> </mrow> <msqrt> <mrow> <msup> <mrow> <mo>(</mo> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>-</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>W</mi> <mrow> <mn>1</mn> <mi>K</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>&amp;times;</mo> <msup> <mrow> <mo>(</mo> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>-</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>W</mi> <mrow> <mn>2</mn> <mi>K</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> </mfrac> <mo>;</mo> </mrow> <mrow><mi>S</mi><mi>i</mi><mi>m</mi><mrow><mo>(</mo><msub><mi>D</mi><mn>1</mn></msub><mo>,</mo><msub><mi>D</mi><mn>2</mn></msub><mo>)</mo></mrow><mo>=</mo><mi>c</mi><mi>o</mi><mi>s</mi><mi>&amp;theta;</mi><mo>=</mo><mfrac><mrow><msubsup><mo>&amp;Sigma;</mo><mrow><mi>k</mi><mo>-</mo><mn>1</mo>mn></mrow><mi>n</mi></msubsup><msub><mi>W</mi><mrow><mn>1</mn><mi>K</mi></mrow></msub><mo>&amp;times;</mo><msub><mi>W</mi><mrow><mn>2</mn><mi>K</mi></mrow></msub></mrow><msqrt><mrow><msup><mrow><mo>(</mo><msubsup><mo>&amp;Sigma;</mo><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow><mi>n</mi></msubsup><msub><mi>W</mi><mrow><mn>1</mn><mi>K</mi></mrow></msub><mo>)</mo></mrow><mn>2</mn></msup><mo>&amp;times;</mo><msup><mrow><mo>(</mo><msubsup><mo>&amp;Sigma;</mo><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow><mi>n</mi></msubsup><msub><mi>W</mi><mrow><mn>2</mn><mi>K</mi></mrow></msub><mo>)</mo></mrow><mn>2</mn></msup></mrow></msqrt></mfrac><mo>;</mo></mrow> 其中D1,D2表示两个序列,W1K,W2K表示D1、D2序列的向量项,相似度即两个向量夹角的余弦值来表示,Sim(D1,D2)越大,表示两者相似度越高;Among them, D 1 and D 2 represent two sequences, W 1K and W 2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D 1 , D 2 ), Indicates the higher the similarity between the two; 步骤三,在每条数据存储到数据表时,利用SQL语句按照时间戳、进程号、记录级别、进程模块、分隔符、记录信息分割记录;Step 3, when each piece of data is stored in the data table, use the SQL statement to divide the records according to the timestamp, process number, record level, process module, delimiter, and record information; 步骤四,利用SQL语句将处理过的标准格式化数据进行持久化存储;Step 4, using SQL statements to persist the processed standard formatted data; 步骤五,提取日志故障序列;Step 5, extracting the log fault sequence; 步骤六,聚类标准根据序列的似然值作为度量值来计算,采用层次聚类算法实现故障相关事件分组,其中:Step six, the clustering criteria are based on the likelihood value of the sequence Computed as a metric, a hierarchical clustering algorithm is used to group fault-related events, where: S=[si]表示一个长为L状态序列,为在状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵;S=[s i ] represents a long L state sequence, is the probability matrix of the observed values under the initial state probability vector π=[π i ] in the state s i (k); 步骤七,采用隐半马尔可夫模型HSMM和贝叶斯网络Bayes Net相结合,对实时日志数据做出故障预测;Step 7, using the combination of hidden semi-Markov model HSMM and Bayesian network Bayes Net to make fault prediction for real-time log data; 标准HSMM可由状态之间转化概率矩阵G(t)=[gij(t)]、状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵B=bi(k),定义为λ=(π,G(t),B);将状态持续时间概率分布连续化;将状态持续时间的分布作为连续分布来处理,并且假设其服从威布尔分布来描述状态持续时间概率分布,状态的状态持续时间概率分布fi(l)为:The standard HSMM can be converted from state to state by probability matrix G(t)=[g ij (t)], state s i (k) under initial state probability vector π=[π i ], probability matrix B= bi (k), defined as λ=(π, G(t), B); the state duration probability distribution is continuous; the state duration distribution is treated as a continuous distribution, and it is assumed to obey the Weibull distribution to describe the state Duration probability distribution, the state duration probability distribution f i (l) of the state is: fi(l)=αβ(αl)β-1e-(αl)βf i (l) = αβ(αl) β-1 e -(αl)β ; 式中:α、β分别为威布尔分布的尺度参数和形状参数;In the formula: α and β are the scale parameter and shape parameter of Weibull distribution respectively; 将状态监测值概率分布连续化;同样设定其服从威布尔分布,状态检测值概率分布函数ξi(θ)为:The probability distribution of the state monitoring value is continuous; it is also set to obey the Weibull distribution, and the probability distribution function ξ i (θ) of the state detection value is: <mrow> <msub> <mi>&amp;xi;</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>&amp;theta;</mi> <mo>|</mo> <msub> <mi>q</mi> <mi>t</mi> </msub> <mo>=</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <msub> <mi>&amp;beta;</mi> <mi>i</mi> </msub> <msup> <mrow> <mo>(</mo> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mrow> <msub> <mi>&amp;beta;</mi> <mi>i</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </msup> <msup> <mi>e</mi> <mrow> <mo>(</mo> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <mi>&amp;theta;</mi> <mo>)</mo> <msub> <mi>&amp;beta;</mi> <mi>i</mi> </msub> </mrow> </msup> <mo>;</mo> </mrow> <mrow><msub><mi>&amp;xi;</mi><mi>i</mi></msub><mrow><mo>(</mo><mi>&amp;theta;</mi><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>y</mi><mi>t</mi></msub><mo>=</mo><mi>&amp;theta;</mi><mo>|</mo><msub><mi>q</mi><mi>t</mi></msub><mo>=</mo><msub><mi>x</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><msub><mi>&amp;alpha;</mi><mi>i</mi></msub><msub><mi>&amp;beta;</mi><mi>i</mi></msub><msup><mrow><mo>(</mo><msub><mi>&amp;alpha;</mi><mi>i</mi></msub><mi>&amp;theta;</mi><mo>)</mo></mrow><mrow><msub><mi>&amp;beta;</mi><mi>i</mi></msub><mo>-</mo><mn>1</mn></mrow></msup><msup><mi>e</mi><mrow><mo>(</mo><msub><mi>&amp;alpha;</mi><mi>i</mi></msub><mi>&amp;theta;</mi><mo>)</mo><msub><mi>&amp;beta;</mi><mi>i</mi></msub></mrow></msup><mo>;</mo></mrow> 其中αi、βi为各状态阶段的威布尔分布的参数;改进的HSMM模型可描述为 Among them, α i and β i are the parameters of Weibull distribution in each state stage; the improved HSMM model can be described as 步骤八,故障和非故障模型进行训练,参数目标是评估,给定一个观察序列O=[o1,o2,...,ol]是否为故障相关序列;计算分类模型的序列似然值,随后被分类为无故障或故障贝叶斯决策理论;Step 8, fault and non-fault models are trained, parameters and The goal is to evaluate, given a sequence of observations O = [o 1 , o 2 , ..., o l ], whether it is a fault-related sequence; compute the sequence likelihood for a classification model, subsequently classified as fault-free or fault-Bayes Adams decision theory; 步骤九,故障结果预判:Step 9, predict the failure result: 将一个序列标记成为故障相关事件序列,系统发出故障预测;其中表示错误的将故障相关序列判断成为故障无关序列的代价,P(F)表示故障的概率,表示对序列似然值取对数。A sequence is marked as a fault-related event sequence, and the system issues a fault prediction; where Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates taking the logarithm of the sequence likelihood. 3.如权利要求2所述的实时日志控制方法,其特征在于,所述提取日志故障序列具体包括:3. the real-time log control method as claimed in claim 2, is characterized in that, described extraction log failure sequence specifically comprises: 第一步,提取错误事件序列:利用SQL语句,根据日志等级将ERROR级别的记录过提取出来,保留时间戳和文本消息信息;The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain the timestamp and text message information; 第二步,合并相似错误事件:对事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;最小编辑距离包含子最小编辑距离;The second step is to merge similar error events: use the Levenshtein edit distance algorithm for event sequences to merge error events with greater similarity; the minimum edit distance includes sub-minimum edit distance; 其中d[i-1,j]+1代表目标日志插入一个字母,d[i,j-1]+1代表匹配日志删除一个字母;然后xi=yj时,不需要修改,所以和上一步d[i-1,j-1]+1代价相同,否则+1,d[i,j]表示以上三者中最小的一项;Among them, d [i-1, j] + 1 means that the target log inserts a letter, and d [i, j-1] + 1 means that the matching log deletes a letter; then when x i = y j , no modification is required, so it is the same as above The cost of one step d [i-1, j-1] +1 is the same, otherwise +1, d [i, j] represents the smallest item among the above three; 第三步,错误事件分类:经过上一步将错误事件合并后,根据错误事件的文本信息中的关键字将相似的错误事件进行归类,并赋值ID,保存在数据库中;The third step is to classify error events: after merging error events in the previous step, classify similar error events according to the keywords in the text information of error events, assign IDs, and store them in the database; 第四步,提取序列:按照时间顺序,提取在故障发生前一段时间内的事件,设定为故障相关事件序列,为故障前置时间,当前故障事件为相关故障事件;非故障相关事件序列则是在系统未发生故障的时间区间内的事件序列。The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs events within, set to a sequence of fault-related events, is the fault lead time, the current fault event is the related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail. 4.一种如权利要求1所述实时日志控制方法的实时日志控制系统,其特征在于,所述实时日志控制系统包括:日志信息处理模块、日志故障分析模块。4. A real-time log control system according to the real-time log control method of claim 1, wherein the real-time log control system comprises: a log information processing module and a log failure analysis module. 5.如权利要求4所述的实时日志控制系统,其特征在于,所述日志故障分析模块包括:5. the real-time log control system as claimed in claim 4, is characterized in that, described log failure analysis module comprises: 收集日志信息单元,用于收集分布式系统中各个节点上的日志文件数据,日志收集功能应该允许自定义所要监听的日志文件,通过增量检查的方法,将新产生日志数据实时地发送给收集端;The log information collection unit is used to collect the log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send the newly generated log data to the collection in real time through the incremental check method. end; 日志信息过滤单元,用于进行数据的去冗余和过滤;The log information filtering unit is used for de-redundancy and filtering of data; 日志信息标准格式化单元,用于处理过的日志信息进行数据标准格式化;The log information standard formatting unit is used for data standard formatting of the processed log information; 日志存储单元,用于将处理过的标准格式化数据进行持久化存储。The log storage unit is used for persistent storage of the processed standard formatted data. 6.如权利要求4所述的实时日志控制系统,其特征在于,所述日志故障分析模块包括:6. the real-time log control system as claimed in claim 4, is characterized in that, described log failure analysis module comprises: 提取日志事件序列单元;extract log event sequence unit; 故障相关事件聚类单元,用于利用事件提前训练出一个小的隐半马尔可夫模型,求序列似然值;The fault-related event clustering unit is used to use events to train a small hidden semi-Markov model in advance to calculate the sequence likelihood value; 故障预测单元,使用隐半马尔可夫模型和贝叶斯分贝理论,判定序列是否为故障相关序列;The fault prediction unit uses the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence; 故障结果判断输出单元:当判定为故障相关序列时,系统发出故障警告流,输出状态故障预警。Fault result judgment output unit: When it is judged to be a fault-related sequence, the system sends out a fault warning flow and outputs a status fault warning. 7.如权利要求6所述的实时日志控制系统,其特征在于,所述提取日志事件序列单元进一步包括:7. The real-time log control system according to claim 6, wherein the extracting log event sequence unit further comprises: 提取错误事件记录单元,根据日志等级将ERROR级别的记录过提取出来,保留时间戳、进程模块和文本消息信息;Extract the error event record unit, extract the records of the ERROR level according to the log level, and retain the time stamp, process module and text message information; 合并相似错误事件单元,将错误事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;Merge similar error event units, and use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity; 错误事件分类单元,对事件序列采用Levenshtein编辑距离算法,将相似的错误事件进行归类,并赋值ID;The error event classification unit uses the Levenshtein edit distance algorithm for the event sequence to classify similar error events and assign IDs; 提取故障相关序列单元,按照时间先后顺序,提取故障前一段时间内的事件,设定为故障前置事件。Extract fault-related sequence units, and extract events in a period of time before the fault according to the chronological order, and set them as fault pre-events. 8.一种利用权利要求1~3任意一项所述实时日志控制方法的云计算系统。8. A cloud computing system using the real-time log control method according to any one of claims 1-3. 9.一种利用权利要求1~3任意一项所述实时日志控制方法的云计算服务器。9. A cloud computing server using the real-time log control method according to any one of claims 1-3.
CN201711333074.7A 2017-12-13 2017-12-13 Real-time log control system and control method, cloud computing system and server Active CN108038049B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711333074.7A CN108038049B (en) 2017-12-13 2017-12-13 Real-time log control system and control method, cloud computing system and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711333074.7A CN108038049B (en) 2017-12-13 2017-12-13 Real-time log control system and control method, cloud computing system and server

Publications (2)

Publication Number Publication Date
CN108038049A true CN108038049A (en) 2018-05-15
CN108038049B CN108038049B (en) 2021-11-09

Family

ID=62102328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711333074.7A Active CN108038049B (en) 2017-12-13 2017-12-13 Real-time log control system and control method, cloud computing system and server

Country Status (1)

Country Link
CN (1) CN108038049B (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063017A (en) * 2018-07-12 2018-12-21 广州市闲愉凡生信息科技有限公司 Data persistence distribution method of cloud computing platform
CN109218407A (en) * 2018-08-14 2019-01-15 平安普惠企业管理有限公司 Code management-control method and terminal device based on log monitoring technology
CN109343990A (en) * 2018-09-25 2019-02-15 江苏润和软件股份有限公司 A kind of cloud computing system method for detecting abnormality based on deep learning
CN109460362A (en) * 2018-11-06 2019-03-12 北京京航计算通讯研究所 System interface timing knowledge analysis system based on fine granularity Feature Semantics network
CN109460478A (en) * 2018-11-06 2019-03-12 北京京航计算通讯研究所 System interface timing knowledge analysis method based on fine granularity Feature Semantics network
CN109885456A (en) * 2019-02-20 2019-06-14 武汉大学 A multi-type fault event prediction method and device based on system log clustering
CN110598871A (en) * 2018-05-23 2019-12-20 中国移动通信集团浙江有限公司 Method and system for flexibly controlling service flow under micro-service architecture
WO2020000763A1 (en) * 2018-06-29 2020-01-02 平安科技(深圳)有限公司 Network risk monitoring method and apparatus, computer device and storage medium
WO2020001642A1 (en) * 2018-06-28 2020-01-02 中兴通讯股份有限公司 Operation and maintenance system and method
CN110647446A (en) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
CN110704221A (en) * 2019-09-02 2020-01-17 西安交通大学 Data center fault prediction method based on data enhancement
CN111444156A (en) * 2020-04-20 2020-07-24 南阳理工学院 A fault diagnosis method based on cloud computing
CN111585799A (en) * 2020-04-29 2020-08-25 杭州迪普科技股份有限公司 Network fault prediction model establishing method and device
CN111858526A (en) * 2020-06-19 2020-10-30 国网福建省电力有限公司信息通信分公司 Fault time and space prediction method and system based on information system log
CN111858263A (en) * 2020-06-12 2020-10-30 苏州浪潮智能科技有限公司 Log analysis-based fault prediction method, system and device
CN111881011A (en) * 2020-07-31 2020-11-03 网易(杭州)网络有限公司 Log management method, platform, server and storage medium
CN111881153A (en) * 2020-07-24 2020-11-03 北京金山云网络技术有限公司 Data processing method and device, electronic equipment and machine-readable storage medium
CN112000502A (en) * 2020-08-11 2020-11-27 杭州安恒信息技术股份有限公司 Method, device, electronic device and storage medium for processing massive error logs
CN112084105A (en) * 2019-06-13 2020-12-15 中兴通讯股份有限公司 Log file monitoring and early warning method, device, equipment and storage medium
CN112416732A (en) * 2021-01-20 2021-02-26 国能信控互联技术有限公司 Hidden Markov model-based data acquisition operation anomaly detection method
CN112738088A (en) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 Behavior sequence anomaly detection method and system based on unsupervised algorithm
CN112800666A (en) * 2021-01-18 2021-05-14 上海派拉软件股份有限公司 Log behavior analysis training method and identity security risk prediction method
CN112988440A (en) * 2021-02-23 2021-06-18 山东英信计算机技术有限公司 System fault prediction method and device, electronic equipment and storage medium
CN113806178A (en) * 2021-09-22 2021-12-17 中国建设银行股份有限公司 Cluster node fault detection method and device
CN114169651A (en) * 2022-02-14 2022-03-11 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity
CN114676105A (en) * 2022-03-29 2022-06-28 国家电网有限公司信息通信分公司 Log data preprocessing method and device
CN115033889A (en) * 2022-06-22 2022-09-09 中国电信股份有限公司 Illegal copyright detection method and device, storage medium and computer equipment
CN115426276A (en) * 2022-08-22 2022-12-02 神华准格尔能源有限责任公司 Monitoring method for strip mine 5G major equipment and cloud server
CN116192612A (en) * 2023-04-23 2023-05-30 成都新西旺自动化科技有限公司 System fault monitoring and early warning system and method based on log analysis
CN116520817A (en) * 2023-07-05 2023-08-01 贵州宏信达高新科技有限责任公司 ETC system running state real-time monitoring system and method based on expressway
WO2023231192A1 (en) * 2022-05-31 2023-12-07 中电信数智科技有限公司 Srv6-based intelligent network and device fault prediction method and system
CN117348586A (en) * 2023-10-11 2024-01-05 江苏云涌电子科技股份有限公司 Event sequence record SOE implementation method based on energy storage EMS system
CN118192502A (en) * 2024-03-26 2024-06-14 南京依维柯汽车有限公司 Vehicle fault diagnosis system and method
JP7504307B1 (en) 2023-05-23 2024-06-21 三菱電機株式会社 Information processing device, analysis system, analysis method, and program
CN118740604A (en) * 2024-07-18 2024-10-01 南京财经大学 A cloud application fault location method and device based on knowledge analysis

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080004904A1 (en) * 2006-06-30 2008-01-03 Tran Bao Q Systems and methods for providing interoperability among healthcare devices
CN102968556A (en) * 2012-11-08 2013-03-13 重庆大学 Probability distribution-based distribution network reliability judgment method
CN103761173A (en) * 2013-12-28 2014-04-30 华中科技大学 Log based computer system fault diagnosis method and device
CN103825272A (en) * 2014-03-18 2014-05-28 国家电网公司 Reliability determination method for power distribution network with distributed wind power based on analytical method
CN104361169A (en) * 2014-11-12 2015-02-18 武汉科技大学 Method for monitoring reliability of modeling based on decomposition method
CN104537487A (en) * 2014-12-25 2015-04-22 云南电网公司电力科学研究院 Assessment method of operating dynamic risk of electric transmission and transformation equipment
CN104778370A (en) * 2015-04-20 2015-07-15 北京交通大学 Risk analyzing method based on Monte-Carlo simulation solution dynamic fault tree model
CN105095918A (en) * 2015-09-07 2015-11-25 上海交通大学 Multi-robot system fault diagnosis method
CN105653444A (en) * 2015-12-23 2016-06-08 北京大学 Internet log data-based software defect failure recognition method and system
CN105893208A (en) * 2016-03-31 2016-08-24 城云科技(杭州)有限公司 Cloud computing platform system fault prediction method based on hidden semi-Markov models
CN107423205A (en) * 2017-07-11 2017-12-01 北京明朝万达科技股份有限公司 A kind of system failure method for early warning and system for anti-data-leakage system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080004904A1 (en) * 2006-06-30 2008-01-03 Tran Bao Q Systems and methods for providing interoperability among healthcare devices
CN102968556A (en) * 2012-11-08 2013-03-13 重庆大学 Probability distribution-based distribution network reliability judgment method
CN103761173A (en) * 2013-12-28 2014-04-30 华中科技大学 Log based computer system fault diagnosis method and device
CN103825272A (en) * 2014-03-18 2014-05-28 国家电网公司 Reliability determination method for power distribution network with distributed wind power based on analytical method
CN104361169A (en) * 2014-11-12 2015-02-18 武汉科技大学 Method for monitoring reliability of modeling based on decomposition method
CN104537487A (en) * 2014-12-25 2015-04-22 云南电网公司电力科学研究院 Assessment method of operating dynamic risk of electric transmission and transformation equipment
CN104778370A (en) * 2015-04-20 2015-07-15 北京交通大学 Risk analyzing method based on Monte-Carlo simulation solution dynamic fault tree model
CN105095918A (en) * 2015-09-07 2015-11-25 上海交通大学 Multi-robot system fault diagnosis method
CN105653444A (en) * 2015-12-23 2016-06-08 北京大学 Internet log data-based software defect failure recognition method and system
CN105893208A (en) * 2016-03-31 2016-08-24 城云科技(杭州)有限公司 Cloud computing platform system fault prediction method based on hidden semi-Markov models
CN107423205A (en) * 2017-07-11 2017-12-01 北京明朝万达科技股份有限公司 A kind of system failure method for early warning and system for anti-data-leakage system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FELIX SALFNER 等: "Using Hidden Semi-Markov Models for Effective Online Failure Prediction", 《26TH IEEE INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS》 *

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598871A (en) * 2018-05-23 2019-12-20 中国移动通信集团浙江有限公司 Method and system for flexibly controlling service flow under micro-service architecture
CN110647446B (en) * 2018-06-26 2023-02-21 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
CN110647446A (en) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
KR102483025B1 (en) * 2018-06-28 2022-12-29 지티이 코포레이션 Operational maintenance systems and methods
KR20210019564A (en) * 2018-06-28 2021-02-22 지티이 코포레이션 Operation maintenance system and method
CN110659173B (en) * 2018-06-28 2023-05-26 中兴通讯股份有限公司 Operation and maintenance system and method
US11947438B2 (en) 2018-06-28 2024-04-02 Xi'an Zhongxing New Software Co., Ltd. Operation and maintenance system and method
WO2020001642A1 (en) * 2018-06-28 2020-01-02 中兴通讯股份有限公司 Operation and maintenance system and method
CN110659173A (en) * 2018-06-28 2020-01-07 中兴通讯股份有限公司 Operation and maintenance system and method
WO2020000763A1 (en) * 2018-06-29 2020-01-02 平安科技(深圳)有限公司 Network risk monitoring method and apparatus, computer device and storage medium
CN109063017A (en) * 2018-07-12 2018-12-21 广州市闲愉凡生信息科技有限公司 Data persistence distribution method of cloud computing platform
CN109218407B (en) * 2018-08-14 2022-10-25 平安普惠企业管理有限公司 Code management and control method based on log monitoring technology and terminal equipment
CN109218407A (en) * 2018-08-14 2019-01-15 平安普惠企业管理有限公司 Code management-control method and terminal device based on log monitoring technology
CN109343990A (en) * 2018-09-25 2019-02-15 江苏润和软件股份有限公司 A kind of cloud computing system method for detecting abnormality based on deep learning
CN109460478A (en) * 2018-11-06 2019-03-12 北京京航计算通讯研究所 System interface timing knowledge analysis method based on fine granularity Feature Semantics network
CN109460362A (en) * 2018-11-06 2019-03-12 北京京航计算通讯研究所 System interface timing knowledge analysis system based on fine granularity Feature Semantics network
CN109885456A (en) * 2019-02-20 2019-06-14 武汉大学 A multi-type fault event prediction method and device based on system log clustering
CN112084105A (en) * 2019-06-13 2020-12-15 中兴通讯股份有限公司 Log file monitoring and early warning method, device, equipment and storage medium
CN110704221A (en) * 2019-09-02 2020-01-17 西安交通大学 Data center fault prediction method based on data enhancement
CN110704221B (en) * 2019-09-02 2020-10-27 西安交通大学 A data-enhanced fault prediction method for data centers
CN111444156A (en) * 2020-04-20 2020-07-24 南阳理工学院 A fault diagnosis method based on cloud computing
CN111444156B (en) * 2020-04-20 2023-01-24 南阳理工学院 Fault diagnosis method based on cloud computing
CN111585799A (en) * 2020-04-29 2020-08-25 杭州迪普科技股份有限公司 Network fault prediction model establishing method and device
CN111858263A (en) * 2020-06-12 2020-10-30 苏州浪潮智能科技有限公司 Log analysis-based fault prediction method, system and device
CN111858263B (en) * 2020-06-12 2022-08-02 苏州浪潮智能科技有限公司 Log analysis-based fault prediction method, system and device
CN111858526A (en) * 2020-06-19 2020-10-30 国网福建省电力有限公司信息通信分公司 Fault time and space prediction method and system based on information system log
CN111858526B (en) * 2020-06-19 2022-08-16 国网福建省电力有限公司信息通信分公司 Failure time space prediction method and system based on information system log
CN111881153A (en) * 2020-07-24 2020-11-03 北京金山云网络技术有限公司 Data processing method and device, electronic equipment and machine-readable storage medium
CN111881011A (en) * 2020-07-31 2020-11-03 网易(杭州)网络有限公司 Log management method, platform, server and storage medium
CN112000502A (en) * 2020-08-11 2020-11-27 杭州安恒信息技术股份有限公司 Method, device, electronic device and storage medium for processing massive error logs
CN112738088A (en) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 Behavior sequence anomaly detection method and system based on unsupervised algorithm
CN112738088B (en) * 2020-12-28 2023-03-21 上海观安信息技术股份有限公司 Behavior sequence anomaly detection method and system based on unsupervised algorithm
CN112800666A (en) * 2021-01-18 2021-05-14 上海派拉软件股份有限公司 Log behavior analysis training method and identity security risk prediction method
CN112416732B (en) * 2021-01-20 2021-06-01 国能信控互联技术有限公司 Hidden Markov model-based data acquisition operation anomaly detection method
CN112416732A (en) * 2021-01-20 2021-02-26 国能信控互联技术有限公司 Hidden Markov model-based data acquisition operation anomaly detection method
CN112988440A (en) * 2021-02-23 2021-06-18 山东英信计算机技术有限公司 System fault prediction method and device, electronic equipment and storage medium
CN112988440B (en) * 2021-02-23 2023-08-01 山东英信计算机技术有限公司 System fault prediction method and device, electronic equipment and storage medium
CN113806178A (en) * 2021-09-22 2021-12-17 中国建设银行股份有限公司 Cluster node fault detection method and device
CN113806178B (en) * 2021-09-22 2024-06-28 中国建设银行股份有限公司 Cluster node fault detection method and device
CN114169651A (en) * 2022-02-14 2022-03-11 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity
CN114169651B (en) * 2022-02-14 2022-04-19 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity
CN114676105A (en) * 2022-03-29 2022-06-28 国家电网有限公司信息通信分公司 Log data preprocessing method and device
WO2023231192A1 (en) * 2022-05-31 2023-12-07 中电信数智科技有限公司 Srv6-based intelligent network and device fault prediction method and system
CN115033889A (en) * 2022-06-22 2022-09-09 中国电信股份有限公司 Illegal copyright detection method and device, storage medium and computer equipment
CN115033889B (en) * 2022-06-22 2023-10-31 中国电信股份有限公司 Illegal right-raising detection method and device, storage medium and computer equipment
CN115426276A (en) * 2022-08-22 2022-12-02 神华准格尔能源有限责任公司 Monitoring method for strip mine 5G major equipment and cloud server
CN115426276B (en) * 2022-08-22 2024-03-12 神华准格尔能源有限责任公司 Method for monitoring 5G major equipment of strip mine and cloud server
CN116192612A (en) * 2023-04-23 2023-05-30 成都新西旺自动化科技有限公司 System fault monitoring and early warning system and method based on log analysis
WO2024241491A1 (en) * 2023-05-23 2024-11-28 三菱電機株式会社 Information processing device, analysis system, analysis method, and program
JP7504307B1 (en) 2023-05-23 2024-06-21 三菱電機株式会社 Information processing device, analysis system, analysis method, and program
CN116520817B (en) * 2023-07-05 2023-08-29 贵州宏信达高新科技有限责任公司 ETC system running state real-time monitoring system and method based on expressway
CN116520817A (en) * 2023-07-05 2023-08-01 贵州宏信达高新科技有限责任公司 ETC system running state real-time monitoring system and method based on expressway
CN117348586B (en) * 2023-10-11 2024-02-27 江苏云涌电子科技股份有限公司 Event sequence record SOE implementation method based on energy storage EMS system
CN117348586A (en) * 2023-10-11 2024-01-05 江苏云涌电子科技股份有限公司 Event sequence record SOE implementation method based on energy storage EMS system
CN118192502A (en) * 2024-03-26 2024-06-14 南京依维柯汽车有限公司 Vehicle fault diagnosis system and method
CN118740604A (en) * 2024-07-18 2024-10-01 南京财经大学 A cloud application fault location method and device based on knowledge analysis
CN118740604B (en) * 2024-07-18 2025-01-28 南京财经大学 A cloud application fault location method and device based on knowledge analysis

Also Published As

Publication number Publication date
CN108038049B (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN108038049A (en) Real-time logs control system and control method, cloud computing system and server
CN110322048B (en) Fault early warning method for production logistics conveying equipment
CN108536123B (en) Fault diagnosis method for on-board train control equipment based on long short-term memory neural network
CN106844161B (en) Abnormity monitoring and predicting method and system in calculation system with state flow
CN110570012B (en) A Storm-based fault warning method and system for power plant production equipment
CN101950327B (en) Equipment state prediction method based on fault tree information
CN107358347A (en) Equipment cluster health state evaluation method based on industrial big data
CN114048870A (en) An abnormal monitoring method of power system based on intelligent mining of log features
CN106504116A (en) Based on the stability assessment method that operation of power networks is associated with transient stability margin index
CN111435366A (en) Equipment fault diagnosis method and device and electronic equipment
CN110134566A (en) A method for monitoring information system performance in cloud environment based on tag technology
CN115809183A (en) Method for discovering and disposing information-creating terminal fault based on knowledge graph
CN103761173A (en) Log based computer system fault diagnosis method and device
CN105893208A (en) Cloud computing platform system fault prediction method based on hidden semi-Markov models
CN110399278B (en) Alarm fusion system and method based on data center anomaly monitoring
CN104777827A (en) Method for diagnosing fault of high-speed railway signal system vehicle-mounted equipment
CN113204914B (en) Flight data abnormity interpretation method based on multi-flight data characterization modeling
CN108763048B (en) A method for early warning and reliability evaluation of hard disk failure based on particle filter
CN111581056B (en) Software engineering database maintenance and early warning system based on artificial intelligence
CN113485878B (en) Multi-data center fault detection method
CN109469919B (en) Power station air preheater ash blocking monitoring method based on weight clustering
CN115665787A (en) A low-overhead AMF network intelligent fault diagnosis method based on machine learning
CN115758908A (en) An online prediction method of alarms in the case of alarm floods based on deep learning
CN114676791A (en) Electric power system alarm information processing method based on fuzzy evidence reasoning
CN112118127B (en) Service reliability guarantee method based on fault similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant