CN108038049A - Real-time logs control system and control method, cloud computing system and server - Google Patents
Real-time logs control system and control method, cloud computing system and server Download PDFInfo
- Publication number
- CN108038049A CN108038049A CN201711333074.7A CN201711333074A CN108038049A CN 108038049 A CN108038049 A CN 108038049A CN 201711333074 A CN201711333074 A CN 201711333074A CN 108038049 A CN108038049 A CN 108038049A
- Authority
- CN
- China
- Prior art keywords
- fault
- log
- mrow
- sequence
- msub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 239000000284 extract Substances 0.000 claims abstract description 25
- 238000004458 analytical method Methods 0.000 claims abstract description 19
- 238000001914 filtration Methods 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 230000010365 information processing Effects 0.000 claims description 5
- 238000012986 modification Methods 0.000 claims description 4
- 230000004048 modification Effects 0.000 claims description 4
- 238000013145 classification model Methods 0.000 claims description 3
- 238000005315 distribution function Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 230000002085 persistent effect Effects 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 abstract description 2
- 238000011160 research Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012550 audit Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3452—Performance evaluation by statistical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Debugging And Monitoring (AREA)
Abstract
本发明属于云计算技术领域,公开了一种实时日志控制系统及控制方法、云计算系统及服务器,通过对于日志记录事件的分析,将错误信息进行分类、过滤、聚合操作,提取成为序列,训练故障模型并计算序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。本发明通过对于日志记录事件的分析,将所有的错误信息进行分类、过滤、聚合等操作,提取成为序列,训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测,比起大量的规则匹配来说提高了判断速度;故障预测研究对于减轻网络管理和维护的负担,减少网络故障造成的损失具有重要意义。
The invention belongs to the technical field of cloud computing, and discloses a real-time log control system and control method, a cloud computing system and a server. Through the analysis of log recording events, error information is classified, filtered, aggregated, extracted into sequences, and trained Fault model and calculate the probability of the sequence belonging to the fault sequence and the probability of the non-fault sequence, and use Bayesian classification theory to obtain the results and make predictions. The present invention classifies, filters, and aggregates all error information through the analysis of log recording events, extracts them into sequences, trains the fault model and calculates the probability that the sequence belongs to a fault sequence and the probability of a non-fault sequence, using Bayeux Compared with a large number of rule matching, the results and predictions of Adams classification theory have improved the speed of judgment; fault prediction research is of great significance for reducing the burden of network management and maintenance and reducing the loss caused by network faults.
Description
技术领域technical field
本发明属于云计算技术领域,尤其涉及一种实时日志控制系统及控制方法、云计算系统及服务器。The invention belongs to the technical field of cloud computing, and in particular relates to a real-time log control system and control method, a cloud computing system and a server.
背景技术Background technique
随着计算机技术的高速发展,云计算成为最重要的计算机领域之一,云计算服务深入到每个人的生活和工作当中。能够通过对实时数据的计算,基于机器学习算法对于云计算系统中可能发生的故障进行提前预测,预留出故障响应时间,同时还支持弹性地平扩展集群的处理能力,以适应不断增长的数据量和用户需求。对海量日志数据进行实时计算处理,从数据中挖掘分析出系统的状态、故障预测方面具有良好的发展方向和应用前景。With the rapid development of computer technology, cloud computing has become one of the most important computer fields, and cloud computing services have penetrated into everyone's life and work. Through the calculation of real-time data, based on the machine learning algorithm, it can predict the possible faults in the cloud computing system in advance, reserve the fault response time, and also support the elastic horizontal expansion of the processing capacity of the cluster to adapt to the growing data volume and user needs. Real-time calculation and processing of massive log data, mining and analysis of system status and fault prediction from the data has a good development direction and application prospect.
综上所述,现有技术存在的问题是:原有的故障预测模型中,一方面,状态持续时间分布大多默认为指数型分布,而实际中故障的状态概率变化并不满足指数型;另一方面,在故障状态检测值概率做了离散化处理,这对大数据环境进行实验分析会有意料之外的影响,故本内容采用状态持续时间分布和状态观察值概率分布钧进行连续化分布即假定威布尔分布,采用改进的预测模型可提高诊断和预测的概率值。To sum up, the problems existing in the existing technology are: in the original fault prediction model, on the one hand, the state duration distribution is mostly exponential distribution by default, but the state probability change of the fault does not satisfy the exponential type in reality; On the one hand, the detection value probability of the fault state has been discretized, which will have an unexpected impact on the experimental analysis of the big data environment. Therefore, this content uses the state duration distribution and the state observation value probability distribution to carry out continuous distribution. That is, assuming Weibull distribution, the probability value of diagnosis and prediction can be improved by adopting an improved prediction model.
发明内容Contents of the invention
针对现有技术存在的问题,本发明提供了一种实时日志控制系统及控制方法、云计算系统及服务器。Aiming at the problems existing in the prior art, the present invention provides a real-time log control system and control method, a cloud computing system and a server.
本发明是这样实现的,一种实时日志控制方法,所述实时日志控制方法通过对于日志记录事件的分析,将错误信息进行分类、过滤、聚合操作,提取成为序列,训练故障模型并计算序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。The present invention is achieved in this way, a real-time log control method, the real-time log control method classifies, filters, and aggregates error information through the analysis of log record events, extracts them into sequences, trains fault models and calculates the sequence belongs to The probability of a fault sequence and the probability of a non-fault sequence are derived using Bayesian classification theory to make predictions.
进一步,所述实时日志控制方法具体包括:Further, the real-time log control method specifically includes:
步骤一,收集分布式系统中各个节点上的日志文件数据,通过增量检查将新产生日志数据实时地发送给收集端;Step 1: Collect log file data on each node in the distributed system, and send newly generated log data to the collector in real time through incremental checks;
步骤二,删除在某一时间段内相同位置报告的相同类型事件,删除冗余事件,通过设置时间阈值表示用于执行事件过滤的时间窗口;通过移除某一时间段内由多个不同位置报告的相似事件,删除日志中的冗余事件,将数据流保存到时序数据库中;使用相似性Sim(D1,D2)来判断:Step 2, delete the same type of events reported at the same location within a certain period of time, delete redundant events, by setting the time threshold Represents the time window used to perform event filtering; by removing similar events reported by multiple different locations within a certain period of time, delete redundant events in the log, and save the data stream into the time series database; use the similarity Sim( D 1 , D 2 ) to judge:
其中D1,D2表示两个序列,W1K,W2K表示D1、D2序列的向量项,相似度即两个向量夹角的余弦值来表示,Sim(D1,D2)越大,表示两者相似度越高;Among them, D 1 and D 2 represent two sequences, W 1K and W 2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D 1 , D 2 ), Indicates the higher the similarity between the two;
步骤三,在每条数据存储到数据表时,利用SQL语句按照时间戳、进程号、记录级别、进程模块、分隔符、记录信息分割记录;Step 3, when each piece of data is stored in the data table, use the SQL statement to divide the records according to the timestamp, process number, record level, process module, delimiter, and record information;
步骤四,利用SQL语句将处理过的标准格式化数据进行持久化存储;Step 4, using SQL statements to persist the processed standard formatted data;
步骤五,提取日志故障序列;Step 5, extracting the log fault sequence;
步骤六,聚类标准根据序列的似然值作为度量值来计算,采用层次聚类算法实现故障相关事件分组,其中:Step six, the clustering criteria are based on the likelihood value of the sequence Computed as a metric, a hierarchical clustering algorithm is used to group fault-related events, where:
S=[si]表示一个长为L状态序列,为在状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵;S=[s i ] represents a long L state sequence, is the probability matrix of the observed values under the initial state probability vector π=[π i ] in the state s i (k);
步骤七,采用改进的HSMM和贝叶斯网络BayesNet相结合,对实时日志数据做出故障预测;Step seven, using the combination of improved HSMM and Bayesian network BayesNet to make fault prediction for real-time log data;
标准HSMM可由状态之间转化概率矩阵G(t)=[gij(t)]、状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵B=bi(k),定义为将状态持续时间概率分布连续化;将状态持续时间的分布作为连续分布来处理,并且假设其服从威布尔分布来描述状态持续时间概率分布,状态的状态持续时间概率分布fi(l)为:The standard HSMM can be converted from state to state by probability matrix G(t)=[g ij (t)], state s i (k) under initial state probability vector π=[π i ], probability matrix B= bi (k), defined as The state duration probability distribution is continuous; the state duration distribution is treated as a continuous distribution, and it is assumed to obey the Weibull distribution to describe the state duration probability distribution. The state state duration probability distribution f i (l) is:
fi(l)=αβ(αl)β-1e-(αl)β;f i (l) = αβ(αl) β-1 e -(αl)β ;
式中:α、β分别为威布尔分布的尺度参数和形状参数;In the formula: α and β are the scale parameter and shape parameter of Weibull distribution respectively;
将状态监测值概率分布连续化;同样设定其服从威布尔分布,状态检测值概率分布函数ξi(θ)为:The probability distribution of the state monitoring value is continuous; it is also set to obey the Weibull distribution, and the probability distribution function ξ i (θ) of the state detection value is:
其中αi、βi为各状态阶段的威布尔分布的参数;改进的HSMM模型可描述为 Among them, α i and β i are the parameters of Weibull distribution in each state stage; the improved HSMM model can be described as
步骤八,故障和非故障模型进行训练,参数和目标是评估,给定一个观察序列O=[o1,o2,...,ol]是否为故障相关序列;计算分类模型的序列似然值,随后被分类为无故障或故障贝叶斯决策理论;Step 8, fault and non-fault models are trained, parameters and The goal is to evaluate, given a sequence of observations O = [o 1 , o 2 , ..., o l ], whether it is a fault-related sequence; compute the sequence likelihood for a classification model, subsequently classified as fault-free or fault-Bayes Adams decision theory;
步骤九,故障结果预判:Step 9, predict the failure result:
将一个序列标记成为故障相关事件序列,系统发出故障预测;其中表示错误的将故障相关序列判断成为故障无关序列的代价,P(F)表示故障的概率,表示对序列似然值取对数。A sequence is marked as a fault-related event sequence, and the system issues a fault prediction; where Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates taking the logarithm of the sequence likelihood.
进一步,所述提取日志故障序列具体包括:Further, the extraction log failure sequence specifically includes:
第一步,提取错误事件序列:利用SQL语句,根据日志等级将ERROR级别的记录过提取出来,保留时间戳和文本消息信息;The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain the timestamp and text message information;
第二步,合并相似错误事件:对事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;最小编辑距离包含子最小编辑距离;The second step is to merge similar error events: use the Levenshtein edit distance algorithm for event sequences to merge error events with greater similarity; the minimum edit distance includes sub-minimum edit distance;
其中d[i-1,j]+1代表目标日志插入一个字母,d[i,j-1]-1代表匹配日志删除一个字母;然后xi=yj时,不需要修改,所以和上一步d[i-1,j-1]+1代价相同,否则+1,d[i,j]表示以上三者中最小的一项;Among them, d [i-1, j] + 1 means that the target log inserts a letter, and d [i, j-1] -1 means that the matching log deletes a letter; then when x i = y j , no modification is required, so it is the same as above The cost of one step d [i-1, j-1] +1 is the same, otherwise +1, d [i, j ] represents the smallest item among the above three;
第三步,错误事件分类:经过上一步将错误事件合并后,根据错误事件的文本信息中的关键字将相似的错误事件进行归类,并赋值ID,保存在数据库中;The third step is to classify error events: after merging error events in the previous step, classify similar error events according to the keywords in the text information of error events, assign IDs, and store them in the database;
第四步,提取序列:按照时间顺序,提取在故障发生前一段时间内的事件,设定为故障相关事件序列,为故障前置时间,当前故障事件为相关故障事件;非故障相关事件序列则是在系统未发生故障的时间区间内的事件序列。The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs events within, set to a sequence of fault-related events, is the fault lead time, the current fault event is the related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail.
本发明的另一目的在于提供一种所述实时日志控制方法的实时日志控制系统,所述实时日志控制系统包括:日志信息处理模块、日志故障分析模块。Another object of the present invention is to provide a real-time log control system according to the real-time log control method. The real-time log control system includes: a log information processing module and a log failure analysis module.
进一步,所述日志故障分析模块包括:Further, the log failure analysis module includes:
收集日志信息单元,用于收集分布式系统中各个节点上的日志文件数据,日志收集功能应该允许自定义所要监听的日志文件,通过增量检查的方法,将新产生日志数据实时地发送给收集端;The log information collection unit is used to collect the log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send the newly generated log data to the collection in real time through the incremental check method. end;
日志信息过滤单元,用于进行数据的去冗余和过滤;The log information filtering unit is used for de-redundancy and filtering of data;
日志信息标准格式化单元,用于处理过的日志信息进行数据标准格式化;The log information standard formatting unit is used for data standard formatting of the processed log information;
日志存储单元,用于将处理过的标准格式化数据进行持久化存储。The log storage unit is used for persistent storage of the processed standard formatted data.
进一步,所述日志故障分析模块包括:Further, the log failure analysis module includes:
提取日志事件序列单元;extract log event sequence unit;
故障相关事件聚类单元,用于利用事件提前训练出一个小的隐半马尔可夫模型,求序列似然值;The fault-related event clustering unit is used to use events to train a small hidden semi-Markov model in advance to calculate the sequence likelihood value;
故障预测单元,使用隐半马尔可夫模型和贝叶斯分贝理论,判定序列是否为故障相关序列;The fault prediction unit uses the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence;
故障结果判断输出单元:当判定为故障相关序列时,系统发出故障警告流,输出状态故障预警。Fault result judgment output unit: When it is judged to be a fault-related sequence, the system sends out a fault warning flow and outputs a status fault warning.
所述提取日志事件序列单元进一步包括:The extraction log event sequence unit further includes:
提取错误事件记录单元,根据日志等级将ERROR级别的记录过提取出来,保留时间戳、进程模块和文本消息信息;Extract the error event record unit, extract the records of the ERROR level according to the log level, and retain the time stamp, process module and text message information;
合并相似错误事件单元,将错误事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;Merge similar error event units, and use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity;
错误事件分类单元,对事件序列采用Levenshtein编辑距离算法,将相似的错误事件进行归类,并赋值ID;The error event classification unit uses the Levenshtein edit distance algorithm for the event sequence to classify similar error events and assign IDs;
提取故障相关序列单元,按照时间先后顺序,提取故障前一段时间内的事件,设定为故障前置事件。Extract fault-related sequence units, and extract events in a period of time before the fault according to the chronological order, and set them as fault pre-events.
本发明的另一目的在于提供一种利用所述实时日志控制方法的云计算系统。Another object of the present invention is to provide a cloud computing system utilizing the real-time log control method.
现今故障预测研究工作主要有三类方法,包括基于日志频率的故障检测模型,基于消息频率的故障检测模型和基于状态转移的故障检测模型。There are three main types of fault prediction research work today, including fault detection models based on log frequency, fault detection models based on message frequency, and fault detection models based on state transition.
本发明在系统运行时间内实时收集日志信息并进行聚类处理,通过分析事件日志使用机器学习的算法和模型,实现对系统未来可能发生的故障的预测,在系统运行过程中对系统故障进行提前排查和定位,用于提高系统运维效率和预防紧急故障事件。本发明通过对于日志记录事件的分析,将所有的错误信息进行分类、过滤、聚合等操作,提取成为序列,训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。The present invention collects log information in real time during system operation time and performs clustering processing, and uses machine learning algorithms and models by analyzing event logs to realize the prediction of possible future failures of the system, and to predict system failures in advance during system operation Troubleshooting and positioning are used to improve system operation and maintenance efficiency and prevent emergency failures. The present invention classifies, filters, and aggregates all error information through the analysis of log recording events, extracts them into sequences, trains the fault model and calculates the probability that the sequence belongs to a fault sequence and the probability of a non-fault sequence, using Bayeux The theory of Adams classification draws results and makes predictions.
该方法的有效判断标准主要由三个参数来决定,即准确率、召回率以及F-measure参数,准确率反应的是所有预测中正确的比率,召回率反应的是所有故障中被正确预测出来的比率,F.measure是结合准确率和召回率的一个综合衡量值;The effective judgment standard of this method is mainly determined by three parameters, namely the accuracy rate, recall rate and F-measure parameter. The accuracy rate reflects the correct ratio of all predictions, and the recall rate reflects the correct prediction of all faults. The ratio of F.measure is a comprehensive measure combining precision and recall;
预测情况如下表1:The predictions are as follows in Table 1:
表1预测情况Table 1 Forecast
预测有效性参数如表2:The predictive validity parameters are shown in Table 2:
表2有效性参数表达式Table 2 Validity parameter expression
经过系统实验得出下面数据结论,可看出本次系统在准确率上优于未改进之前After the system experiment, the following data conclusions can be drawn. It can be seen that the accuracy of this system is better than that before no improvement.
附图说明Description of drawings
图1是本发明实施例提供的实时日志控制系统结构示意图;Fig. 1 is a schematic structural diagram of a real-time log control system provided by an embodiment of the present invention;
图中:1、日志信息处理模块;1-1、收集日志信息单元;1-2、日志信息过滤单元;1-3、日志信息标准格式化单元;1-4、日志存储单元;2、日志故障分析模块;2-1、提取日志事件序列单元;2-1-1、提取错误事件记录单元;2-1-2、合并相似错误事件单元;2-1-3、错误事件分类单元;2-1-4、提取故障相关序列单元;2-2、故障相关事件聚类单元;2-3、故障预测单元;2-4、故障结果判断输出单元。In the figure: 1. Log information processing module; 1-1. Collecting log information unit; 1-2. Log information filtering unit; 1-3. Log information standard formatting unit; 1-4. Log storage unit; 2. Log Fault analysis module; 2-1, extraction log event sequence unit; 2-1-1, extraction error event recording unit; 2-1-2, merging similar error event unit; 2-1-3, error event classification unit; 2 -1-4. Extracting fault-related sequence unit; 2-2. Fault-related event clustering unit; 2-3. Fault prediction unit; 2-4. Fault result judgment output unit.
图2是本发明实施例提供的实时日志控制方法流程图。Fig. 2 is a flowchart of a real-time log control method provided by an embodiment of the present invention.
图3是本发明实施例提供的实时日志控制方法的实现流程图。Fig. 3 is a flow chart of realizing the real-time log control method provided by the embodiment of the present invention.
图4是本发明实施例提供的故障序列提取示意图。Fig. 4 is a schematic diagram of fault sequence extraction provided by an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
下面结合附图对本发明的应用原理作详细的描述。The application principle of the present invention will be described in detail below in conjunction with the accompanying drawings.
如图1所示,本发明实施例提供的实时日志控制系统包括:日志信息处理模块1、日志故障分析模块2。As shown in FIG. 1 , the real-time log control system provided by the embodiment of the present invention includes: a log information processing module 1 and a log fault analysis module 2 .
日志故障分析模块1包括:Log failure analysis module 1 includes:
收集日志信息单元1-1:用于收集分布式系统中各个节点上的日志文件数据,日志收集功能应该允许自定义所要监听的日志文件,通过增量检查的方法,将新产生日志数据实时地发送给收集端。Collecting log information unit 1-1: used to collect log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored. Through incremental checking, the newly generated log data will be collected in real time. sent to the collector.
日志信息过滤单元1-2:用于进行数据的去冗余和过滤。Log information filtering unit 1-2: used for de-redundancy and filtering of data.
日志信息标准格式化单元1-3:用于处理过的日志信息进行数据标准格式化,比如按照:时间戳、进程号、记录级别、进程模块、分隔符、记录信息,其中,记录级别分为几大类,包括:ERROR、WARING、TRACE、INFO、DUBUG、CRITICAL、AUDIT,级别越靠前等级越高,等级越高代表事件的重要程度越高。Log information standard formatting unit 1-3: used for data standard formatting of processed log information, for example, according to: timestamp, process number, record level, process module, separator, record information, where the record level is divided into Several categories, including: ERROR, WARING, TRACE, INFO, DUBUG, CRITICAL, AUDIT, the higher the level, the higher the level, and the higher the level, the higher the importance of the event.
日志存储单元1-4:用于将处理过的标准格式化数据进行持久化存储,便于后期数据的提取分析。Log storage unit 1-4: used for persistent storage of processed standard formatted data, which is convenient for later data extraction and analysis.
日志故障分析模块2包括:Log failure analysis module 2 includes:
提取日志事件序列单元2-1:Extract log event sequence unit 2-1:
故障相关事件聚类单元2-2,用于利用事件提前训练出一个小的隐半马尔可夫(HSMM)模型,求序列似然值即给定序列利用训练模型产生的观察序列;The fault-related event clustering unit 2-2 is used to train a small hidden semi-Markov (HSMM) model in advance by using the event, and calculate the sequence likelihood value, that is, the observation sequence generated by the training model for a given sequence;
故障预测单元2-3:使用隐半马尔可夫模型和贝叶斯分贝理论,判定序列是否为故障相关序列;Fault prediction unit 2-3: use the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence;
故障结果判断输出单元2-4:当判定为故障相关序列时,系统发出故障警告流,输出状态故障预警。Fault result judgment output unit 2-4: When it is judged to be a fault-related sequence, the system sends a fault warning flow and outputs a status fault warning.
提取日志事件序列单元2-1进一步包括:Extract log event sequence unit 2-1 further includes:
提取错误事件记录单元2-1-1:根据日志等级将ERROR级别的记录过提取出来,保留时间戳、进程模块和文本消息等信息;Extract error event record unit 2-1-1: Extract the records of ERROR level according to the log level, and retain information such as time stamp, process module and text message;
合并相似错误事件单元2-1-2:将错误事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;Merge similar error events unit 2-1-2: Use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity;
错误事件分类单元2-1-3:对事件序列采用Levenshtein编辑距离算法,将相似的错误事件进行归类,并赋值ID;Error event classification unit 2-1-3: Use the Levenshtein edit distance algorithm for event sequences to classify similar error events and assign IDs;
提取故障相关序列单元2-1-4:按照时间先后顺序,提取故障前一段时间内的事件,设定为故障前置事件。Extract fault-related sequence unit 2-1-4: According to the chronological order, extract the events within a period of time before the fault, and set it as the pre-fault event.
如图2所示,本发明实施例提供的实时日志控制方法包括以下步骤:As shown in Figure 2, the real-time log control method provided by the embodiment of the present invention includes the following steps:
S201:通过对于日志记录事件的分析,将所有的错误信息进行分类、过滤、聚合等操作,提取成为序列;S201: through the analysis of the log record event, classify, filter, aggregate and other operations are performed on all the error information, and extract them into a sequence;
S202:训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率,使用贝叶斯分类理论得出结果,做出预测。S202: Train the fault model and calculate the probability that the sequence belongs to the fault sequence and the probability of the non-fault sequence, and use Bayesian classification theory to obtain the result and make a prediction.
下面结合附图对本发明的应用原理作进一步的描述。The application principle of the present invention will be further described below in conjunction with the accompanying drawings.
比起利用故障关键字进行大量的规则匹配来说,在本发明中,采用改进的HSMM(隐马尔科夫模型)和Bayesdecisiontheory(贝叶斯分类理论),直接计算一个错误序列属于故障序列的概率,提高判断速度。Compared with utilizing fault keywords to carry out a large amount of rule matching, in the present invention, adopt improved HSMM (hidden Markov model) and Bayesdecision theory (Bayesian classification theory), directly calculate the probability that an error sequence belongs to fault sequence , improve the speed of judgment.
如图3所示,本发明实施例提供的实时日志控制方法具体步骤如下:As shown in Figure 3, the specific steps of the real-time log control method provided by the embodiment of the present invention are as follows:
1、日志信息处理过程1. Log information processing process
步骤1,日志信息收集Step 1, log information collection
系统应该能够收集分布式系统中各个节点上的日志文件数据,日志收集功能应该允许自定义所要监听的日志文件,通过增量检查的方法,即将新产生日志数据实时地发送给收集端。The system should be able to collect log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send newly generated log data to the collector in real time through incremental checking.
步骤2,日志信息过滤Step 2, log information filtering
有两种方法:一个是时间过滤,另一个是空间过滤。当系统检测到异常时,在系统发生故障之前,系统会持续输出警告信息流。同样地,一旦系统发生故障,在解决故障问题之前日志中可能会多次反复出现故障信息。There are two methods: one is temporal filtering and the other is spatial filtering. When the system detects an anomaly, the system will continuously output a stream of warning information until the system fails. Likewise, once a system fails, the failure message may appear repeatedly in the log many times until the failure problem is resolved.
时间过滤方法通过删除在某一时间段内相同位置报告的相同类型事件,从而删除冗余事件,通过设置时间阈值表示用于执行事件过滤的时间窗口。空间过滤方法通过移除某一时间段内由多个不同位置报告的相似事件,删除日志中的冗余事件,将数据流保存到时序数据库中,节省空间并提高效率。通常使用相似性Sim(D1,D2)来判断:The time filtering method removes redundant events by removing events of the same type reported at the same location within a certain period of time, by setting a time threshold Indicates the time window used to perform event filtering. The spatial filtering method saves space and improves efficiency by removing similar events reported by multiple different locations within a certain period of time, deleting redundant events in logs, and saving data streams to time series databases. Usually use the similarity Sim(D 1 , D 2 ) to judge:
其中D1,D2表示两个序列,W1K,W2K表示D1、D2序列的向量项,相似度即两个向量夹角的余弦值来表示,Sim(D1,D2)越大,表示两者相似度越高。Among them, D 1 and D 2 represent two sequences, W 1K and W 2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D 1 , D 2 ), Indicates the higher the similarity between the two.
步骤3,日志格式标准化。Step 3, log format standardization.
在将每条数据存储到数据表时,利用SQL语句按照时间戳、进程号、记录级别、进程模块、分隔符、记录信息等分割记录。When storing each piece of data in the data table, use SQL statements to divide records according to timestamp, process number, record level, process module, delimiter, record information, etc.
步骤4,日志存储。Step 4, log storage.
利用SQL语句将处理过的标准格式化数据进行持久化存储,便于后期数据的提取分析。Use SQL statements to store the processed standard formatted data persistently, which is convenient for later data extraction and analysis.
2.日志故障分析:2. Log failure analysis:
在故障表现和系统状态之间建立基于概率因果关系,通过故障出现的先验概率来对隐半马尔科夫模型和贝叶斯网络进行训练,诊断时根据先验概率求解故障表现下各种系统状态的后验概率,直观表达变量的联合概率分布,同时计算各特征造成故障的概率。Establish a probability-based causal relationship between the fault performance and the system state, train the hidden semi-Markov model and Bayesian network through the prior probability of fault occurrence, and solve various systems under the fault performance according to the prior probability when diagnosing The posterior probability of the state, intuitively expresses the joint probability distribution of the variables, and calculates the probability of failure caused by each feature at the same time.
步骤1,提取日志故障序列。Step 1, extract the log fault sequence.
第一步,提取错误事件序列:利用SQL语句,根据日志等级将ERROR级别的记录过提取出来,保留时间戳和文本消息等信息;The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain information such as time stamps and text messages;
第二步,合并相似错误事件:对上一步骤的事件序列利用Levenshtein编辑距离算法,将相似度较大的错误事件合并;The second step is to merge similar error events: use the Levenshtein edit distance algorithm for the event sequence in the previous step to merge the error events with greater similarity;
该算法使用了动态规划的算法策略,该问题具备最优子结构,最小编辑距离包含子最小编辑距离;The algorithm uses the algorithm strategy of dynamic programming, the problem has an optimal substructure, and the minimum edit distance includes sub-minimum edit distance;
其中d[i-1,j]+1代表目标日志插入一个字母,d[i,j-1]+1代表匹配日志删除一个字母;然后xi=yj时,不需要修改,所以和上一步d[i-1,j-1]+1代价相同,否则+1,d[i,j]表示以上三者中最小的一项;Among them, d [i-1, j] + 1 means that the target log inserts a letter, and d [i, j-1] + 1 means that the matching log deletes a letter; then when x i = y j , no modification is required, so it is the same as above The cost of one step d [i-1, j-1] +1 is the same, otherwise +1, d [i, j] represents the smallest item among the above three;
第三步,错误事件分类:经过上一步将错误事件合并后,根据错误事件的文本信息中的关键字将相似的错误事件进行归类,并赋值ID,保存在数据库中;The third step is to classify error events: after merging error events in the previous step, classify similar error events according to the keywords in the text information of error events, assign IDs, and store them in the database;
第四步,提取序列:按照时间顺序,提取在故障发生前一段时间为的事件,设定为故障相关事件序列,为故障前置时间,当前故障事件为相关故障事件;非故障相关事件序列则是在系统未发生故障的时间区间内的事件序列,如图4所示:The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs For the event, set as the sequence of fault-related events, is the fault lead time, and the current fault event is a related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail, as shown in Figure 4:
步骤2,故障相关事件聚类。Step 2, clustering of fault-related events.
实际中,会有多种的故障相关事件序列可能导致同一种的系统故障,而这多种故障相关事件序列的特征是不同的,故需要进行聚类。In practice, there will be a variety of fault-related event sequences that may lead to the same system fault, and the characteristics of these various fault-related event sequences are different, so clustering is required.
聚类标准可根据序列的似然值作为度量值来计算,最后采用层次聚类算法实现故障相关事件分组,其中:The clustering criteria can be based on the likelihood value of the sequence Calculated as a metric, and finally a hierarchical clustering algorithm is used to group fault-related events, where:
S=[si]表示一个长为L状态序列,bsi(oi)为在状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵。S=[s i ] represents a state sequence of length L, and b si (o i ) is the probability matrix of observed values in state s i (k) under the initial state probability vector π=[π i ].
步骤3,训练建立预测模型。Step 3, training and building a prediction model.
预测模型是网络故障预测的关键,所构造的特征直接影响预测模型的性能。本次采用隐半马尔可夫模型(HSMM)和贝叶斯网络(Bayes Net)相结合,针对实时日志数据做出故障预测。The prediction model is the key to network fault prediction, and the constructed features directly affect the performance of the prediction model. This time, a combination of Hidden Semi-Markov Model (HSMM) and Bayesian Network (Bayes Net) is used to make fault prediction for real-time log data.
标准HSMM可由状态之间转化概率矩阵G(t)=[gij(t)]、状态si(k)在初始状态概率向量π=[πi]下的观测值的概率矩阵B=bi(k),定义为 The standard HSMM can be converted from state to state by probability matrix G(t)=[g ij (t)], state s i (k) under initial state probability vector π=[π i ], probability matrix B= bi (k), defined as
本次对HSMM的改进方面有:将状态持续时间概率分布连续化。将状态持续时间的分布作为连续分布来处理,并且假设其服从威布尔分布来描述状态持续时间概率分布,即状态的状态持续时间概率分布fi(l)为:The improvements to HSMM this time include: continuous state duration probability distribution. Treat the distribution of the state duration as a continuous distribution, and assume that it obeys the Weibull distribution to describe the probability distribution of the state duration, that is, the state duration probability distribution f i (l) of the state is:
fi(l)=αβ(αl)β-1e-(αl)β;f i (l) = αβ(αl) β-1 e -(αl)β ;
式中:d、β分别为威布尔分布的尺度参数和形状参数;In the formula: d and β are the scale parameter and shape parameter of Weibull distribution respectively;
将状态监测值概率分布连续化。同样设定其服从威布尔分布,状态检测值概率分布函数ξi(θ)为:The probability distribution of state monitoring values is continuous. It is also assumed that it obeys the Weibull distribution, and the probability distribution function ξ i (θ) of the state detection value is:
其中αi、βi为各状态阶段的威布尔分布的参数;故改进的HSMM模型可描述为 Among them, α i and β i are the parameters of Weibull distribution in each state stage; therefore, the improved HSMM model can be described as
步骤4,故障预测。Step 4, fault prediction.
假设的故障和非故障模型进行训练,即参数和目标是评估,给定一个观察序列(错误序列)O=[o1,o2,...,ol]是否为故障相关序列。首先计算分类模型的序列似然值,随后被分类为无故障或故障贝叶斯决策理论。Assumed faulty and non-faulty models to train, i.e. parameters and The goal is to evaluate, given an observation sequence (error sequence) O=[o 1 , o 2 , . . . , o l ], whether it is a fault-related sequence. Sequence likelihoods are first computed for classification models, which are subsequently classified as failure-free or failure-based Bayesian decision theory.
步骤5,故障结果预判:Step 5, fault result prediction:
上面公式成立时,将一个序列标记成为故障相关事件序列,系统发出故障预测。其中表示错误的将故障相关序列判断成为故障无关序列的代价,P(F)表示故障的概率,表示对序列似然值取对数,这样可防止序列似然值太小而发生溢出问题。通过这样的方法,可以对每个序列进行判断,做出故障预测。When the above formula is established, a sequence is marked as a fault-related event sequence, and the system issues a fault prediction. in Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates that the logarithm of the sequence likelihood value is taken, which can prevent the overflow problem from occurring when the sequence likelihood value is too small. Through such a method, each sequence can be judged and a fault prediction can be made.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711333074.7A CN108038049B (en) | 2017-12-13 | 2017-12-13 | Real-time log control system and control method, cloud computing system and server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711333074.7A CN108038049B (en) | 2017-12-13 | 2017-12-13 | Real-time log control system and control method, cloud computing system and server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108038049A true CN108038049A (en) | 2018-05-15 |
CN108038049B CN108038049B (en) | 2021-11-09 |
Family
ID=62102328
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711333074.7A Active CN108038049B (en) | 2017-12-13 | 2017-12-13 | Real-time log control system and control method, cloud computing system and server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108038049B (en) |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063017A (en) * | 2018-07-12 | 2018-12-21 | 广州市闲愉凡生信息科技有限公司 | Data persistence distribution method of cloud computing platform |
CN109218407A (en) * | 2018-08-14 | 2019-01-15 | 平安普惠企业管理有限公司 | Code management-control method and terminal device based on log monitoring technology |
CN109343990A (en) * | 2018-09-25 | 2019-02-15 | 江苏润和软件股份有限公司 | A kind of cloud computing system method for detecting abnormality based on deep learning |
CN109460362A (en) * | 2018-11-06 | 2019-03-12 | 北京京航计算通讯研究所 | System interface timing knowledge analysis system based on fine granularity Feature Semantics network |
CN109460478A (en) * | 2018-11-06 | 2019-03-12 | 北京京航计算通讯研究所 | System interface timing knowledge analysis method based on fine granularity Feature Semantics network |
CN109885456A (en) * | 2019-02-20 | 2019-06-14 | 武汉大学 | A multi-type fault event prediction method and device based on system log clustering |
CN110598871A (en) * | 2018-05-23 | 2019-12-20 | 中国移动通信集团浙江有限公司 | Method and system for flexibly controlling service flow under micro-service architecture |
WO2020000763A1 (en) * | 2018-06-29 | 2020-01-02 | 平安科技(深圳)有限公司 | Network risk monitoring method and apparatus, computer device and storage medium |
WO2020001642A1 (en) * | 2018-06-28 | 2020-01-02 | 中兴通讯股份有限公司 | Operation and maintenance system and method |
CN110647446A (en) * | 2018-06-26 | 2020-01-03 | 中兴通讯股份有限公司 | Log fault association and prediction method, device, equipment and storage medium |
CN110704221A (en) * | 2019-09-02 | 2020-01-17 | 西安交通大学 | Data center fault prediction method based on data enhancement |
CN111444156A (en) * | 2020-04-20 | 2020-07-24 | 南阳理工学院 | A fault diagnosis method based on cloud computing |
CN111585799A (en) * | 2020-04-29 | 2020-08-25 | 杭州迪普科技股份有限公司 | Network fault prediction model establishing method and device |
CN111858526A (en) * | 2020-06-19 | 2020-10-30 | 国网福建省电力有限公司信息通信分公司 | Fault time and space prediction method and system based on information system log |
CN111858263A (en) * | 2020-06-12 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Log analysis-based fault prediction method, system and device |
CN111881011A (en) * | 2020-07-31 | 2020-11-03 | 网易(杭州)网络有限公司 | Log management method, platform, server and storage medium |
CN111881153A (en) * | 2020-07-24 | 2020-11-03 | 北京金山云网络技术有限公司 | Data processing method and device, electronic equipment and machine-readable storage medium |
CN112000502A (en) * | 2020-08-11 | 2020-11-27 | 杭州安恒信息技术股份有限公司 | Method, device, electronic device and storage medium for processing massive error logs |
CN112084105A (en) * | 2019-06-13 | 2020-12-15 | 中兴通讯股份有限公司 | Log file monitoring and early warning method, device, equipment and storage medium |
CN112416732A (en) * | 2021-01-20 | 2021-02-26 | 国能信控互联技术有限公司 | Hidden Markov model-based data acquisition operation anomaly detection method |
CN112738088A (en) * | 2020-12-28 | 2021-04-30 | 上海观安信息技术股份有限公司 | Behavior sequence anomaly detection method and system based on unsupervised algorithm |
CN112800666A (en) * | 2021-01-18 | 2021-05-14 | 上海派拉软件股份有限公司 | Log behavior analysis training method and identity security risk prediction method |
CN112988440A (en) * | 2021-02-23 | 2021-06-18 | 山东英信计算机技术有限公司 | System fault prediction method and device, electronic equipment and storage medium |
CN113806178A (en) * | 2021-09-22 | 2021-12-17 | 中国建设银行股份有限公司 | Cluster node fault detection method and device |
CN114169651A (en) * | 2022-02-14 | 2022-03-11 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
CN114676105A (en) * | 2022-03-29 | 2022-06-28 | 国家电网有限公司信息通信分公司 | Log data preprocessing method and device |
CN115033889A (en) * | 2022-06-22 | 2022-09-09 | 中国电信股份有限公司 | Illegal copyright detection method and device, storage medium and computer equipment |
CN115426276A (en) * | 2022-08-22 | 2022-12-02 | 神华准格尔能源有限责任公司 | Monitoring method for strip mine 5G major equipment and cloud server |
CN116192612A (en) * | 2023-04-23 | 2023-05-30 | 成都新西旺自动化科技有限公司 | System fault monitoring and early warning system and method based on log analysis |
CN116520817A (en) * | 2023-07-05 | 2023-08-01 | 贵州宏信达高新科技有限责任公司 | ETC system running state real-time monitoring system and method based on expressway |
WO2023231192A1 (en) * | 2022-05-31 | 2023-12-07 | 中电信数智科技有限公司 | Srv6-based intelligent network and device fault prediction method and system |
CN117348586A (en) * | 2023-10-11 | 2024-01-05 | 江苏云涌电子科技股份有限公司 | Event sequence record SOE implementation method based on energy storage EMS system |
CN118192502A (en) * | 2024-03-26 | 2024-06-14 | 南京依维柯汽车有限公司 | Vehicle fault diagnosis system and method |
JP7504307B1 (en) | 2023-05-23 | 2024-06-21 | 三菱電機株式会社 | Information processing device, analysis system, analysis method, and program |
CN118740604A (en) * | 2024-07-18 | 2024-10-01 | 南京财经大学 | A cloud application fault location method and device based on knowledge analysis |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080004904A1 (en) * | 2006-06-30 | 2008-01-03 | Tran Bao Q | Systems and methods for providing interoperability among healthcare devices |
CN102968556A (en) * | 2012-11-08 | 2013-03-13 | 重庆大学 | Probability distribution-based distribution network reliability judgment method |
CN103761173A (en) * | 2013-12-28 | 2014-04-30 | 华中科技大学 | Log based computer system fault diagnosis method and device |
CN103825272A (en) * | 2014-03-18 | 2014-05-28 | 国家电网公司 | Reliability determination method for power distribution network with distributed wind power based on analytical method |
CN104361169A (en) * | 2014-11-12 | 2015-02-18 | 武汉科技大学 | Method for monitoring reliability of modeling based on decomposition method |
CN104537487A (en) * | 2014-12-25 | 2015-04-22 | 云南电网公司电力科学研究院 | Assessment method of operating dynamic risk of electric transmission and transformation equipment |
CN104778370A (en) * | 2015-04-20 | 2015-07-15 | 北京交通大学 | Risk analyzing method based on Monte-Carlo simulation solution dynamic fault tree model |
CN105095918A (en) * | 2015-09-07 | 2015-11-25 | 上海交通大学 | Multi-robot system fault diagnosis method |
CN105653444A (en) * | 2015-12-23 | 2016-06-08 | 北京大学 | Internet log data-based software defect failure recognition method and system |
CN105893208A (en) * | 2016-03-31 | 2016-08-24 | 城云科技(杭州)有限公司 | Cloud computing platform system fault prediction method based on hidden semi-Markov models |
CN107423205A (en) * | 2017-07-11 | 2017-12-01 | 北京明朝万达科技股份有限公司 | A kind of system failure method for early warning and system for anti-data-leakage system |
-
2017
- 2017-12-13 CN CN201711333074.7A patent/CN108038049B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080004904A1 (en) * | 2006-06-30 | 2008-01-03 | Tran Bao Q | Systems and methods for providing interoperability among healthcare devices |
CN102968556A (en) * | 2012-11-08 | 2013-03-13 | 重庆大学 | Probability distribution-based distribution network reliability judgment method |
CN103761173A (en) * | 2013-12-28 | 2014-04-30 | 华中科技大学 | Log based computer system fault diagnosis method and device |
CN103825272A (en) * | 2014-03-18 | 2014-05-28 | 国家电网公司 | Reliability determination method for power distribution network with distributed wind power based on analytical method |
CN104361169A (en) * | 2014-11-12 | 2015-02-18 | 武汉科技大学 | Method for monitoring reliability of modeling based on decomposition method |
CN104537487A (en) * | 2014-12-25 | 2015-04-22 | 云南电网公司电力科学研究院 | Assessment method of operating dynamic risk of electric transmission and transformation equipment |
CN104778370A (en) * | 2015-04-20 | 2015-07-15 | 北京交通大学 | Risk analyzing method based on Monte-Carlo simulation solution dynamic fault tree model |
CN105095918A (en) * | 2015-09-07 | 2015-11-25 | 上海交通大学 | Multi-robot system fault diagnosis method |
CN105653444A (en) * | 2015-12-23 | 2016-06-08 | 北京大学 | Internet log data-based software defect failure recognition method and system |
CN105893208A (en) * | 2016-03-31 | 2016-08-24 | 城云科技(杭州)有限公司 | Cloud computing platform system fault prediction method based on hidden semi-Markov models |
CN107423205A (en) * | 2017-07-11 | 2017-12-01 | 北京明朝万达科技股份有限公司 | A kind of system failure method for early warning and system for anti-data-leakage system |
Non-Patent Citations (1)
Title |
---|
FELIX SALFNER 等: "Using Hidden Semi-Markov Models for Effective Online Failure Prediction", 《26TH IEEE INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS》 * |
Cited By (57)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598871A (en) * | 2018-05-23 | 2019-12-20 | 中国移动通信集团浙江有限公司 | Method and system for flexibly controlling service flow under micro-service architecture |
CN110647446B (en) * | 2018-06-26 | 2023-02-21 | 中兴通讯股份有限公司 | Log fault association and prediction method, device, equipment and storage medium |
CN110647446A (en) * | 2018-06-26 | 2020-01-03 | 中兴通讯股份有限公司 | Log fault association and prediction method, device, equipment and storage medium |
KR102483025B1 (en) * | 2018-06-28 | 2022-12-29 | 지티이 코포레이션 | Operational maintenance systems and methods |
KR20210019564A (en) * | 2018-06-28 | 2021-02-22 | 지티이 코포레이션 | Operation maintenance system and method |
CN110659173B (en) * | 2018-06-28 | 2023-05-26 | 中兴通讯股份有限公司 | Operation and maintenance system and method |
US11947438B2 (en) | 2018-06-28 | 2024-04-02 | Xi'an Zhongxing New Software Co., Ltd. | Operation and maintenance system and method |
WO2020001642A1 (en) * | 2018-06-28 | 2020-01-02 | 中兴通讯股份有限公司 | Operation and maintenance system and method |
CN110659173A (en) * | 2018-06-28 | 2020-01-07 | 中兴通讯股份有限公司 | Operation and maintenance system and method |
WO2020000763A1 (en) * | 2018-06-29 | 2020-01-02 | 平安科技(深圳)有限公司 | Network risk monitoring method and apparatus, computer device and storage medium |
CN109063017A (en) * | 2018-07-12 | 2018-12-21 | 广州市闲愉凡生信息科技有限公司 | Data persistence distribution method of cloud computing platform |
CN109218407B (en) * | 2018-08-14 | 2022-10-25 | 平安普惠企业管理有限公司 | Code management and control method based on log monitoring technology and terminal equipment |
CN109218407A (en) * | 2018-08-14 | 2019-01-15 | 平安普惠企业管理有限公司 | Code management-control method and terminal device based on log monitoring technology |
CN109343990A (en) * | 2018-09-25 | 2019-02-15 | 江苏润和软件股份有限公司 | A kind of cloud computing system method for detecting abnormality based on deep learning |
CN109460478A (en) * | 2018-11-06 | 2019-03-12 | 北京京航计算通讯研究所 | System interface timing knowledge analysis method based on fine granularity Feature Semantics network |
CN109460362A (en) * | 2018-11-06 | 2019-03-12 | 北京京航计算通讯研究所 | System interface timing knowledge analysis system based on fine granularity Feature Semantics network |
CN109885456A (en) * | 2019-02-20 | 2019-06-14 | 武汉大学 | A multi-type fault event prediction method and device based on system log clustering |
CN112084105A (en) * | 2019-06-13 | 2020-12-15 | 中兴通讯股份有限公司 | Log file monitoring and early warning method, device, equipment and storage medium |
CN110704221A (en) * | 2019-09-02 | 2020-01-17 | 西安交通大学 | Data center fault prediction method based on data enhancement |
CN110704221B (en) * | 2019-09-02 | 2020-10-27 | 西安交通大学 | A data-enhanced fault prediction method for data centers |
CN111444156A (en) * | 2020-04-20 | 2020-07-24 | 南阳理工学院 | A fault diagnosis method based on cloud computing |
CN111444156B (en) * | 2020-04-20 | 2023-01-24 | 南阳理工学院 | Fault diagnosis method based on cloud computing |
CN111585799A (en) * | 2020-04-29 | 2020-08-25 | 杭州迪普科技股份有限公司 | Network fault prediction model establishing method and device |
CN111858263A (en) * | 2020-06-12 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Log analysis-based fault prediction method, system and device |
CN111858263B (en) * | 2020-06-12 | 2022-08-02 | 苏州浪潮智能科技有限公司 | Log analysis-based fault prediction method, system and device |
CN111858526A (en) * | 2020-06-19 | 2020-10-30 | 国网福建省电力有限公司信息通信分公司 | Fault time and space prediction method and system based on information system log |
CN111858526B (en) * | 2020-06-19 | 2022-08-16 | 国网福建省电力有限公司信息通信分公司 | Failure time space prediction method and system based on information system log |
CN111881153A (en) * | 2020-07-24 | 2020-11-03 | 北京金山云网络技术有限公司 | Data processing method and device, electronic equipment and machine-readable storage medium |
CN111881011A (en) * | 2020-07-31 | 2020-11-03 | 网易(杭州)网络有限公司 | Log management method, platform, server and storage medium |
CN112000502A (en) * | 2020-08-11 | 2020-11-27 | 杭州安恒信息技术股份有限公司 | Method, device, electronic device and storage medium for processing massive error logs |
CN112738088A (en) * | 2020-12-28 | 2021-04-30 | 上海观安信息技术股份有限公司 | Behavior sequence anomaly detection method and system based on unsupervised algorithm |
CN112738088B (en) * | 2020-12-28 | 2023-03-21 | 上海观安信息技术股份有限公司 | Behavior sequence anomaly detection method and system based on unsupervised algorithm |
CN112800666A (en) * | 2021-01-18 | 2021-05-14 | 上海派拉软件股份有限公司 | Log behavior analysis training method and identity security risk prediction method |
CN112416732B (en) * | 2021-01-20 | 2021-06-01 | 国能信控互联技术有限公司 | Hidden Markov model-based data acquisition operation anomaly detection method |
CN112416732A (en) * | 2021-01-20 | 2021-02-26 | 国能信控互联技术有限公司 | Hidden Markov model-based data acquisition operation anomaly detection method |
CN112988440A (en) * | 2021-02-23 | 2021-06-18 | 山东英信计算机技术有限公司 | System fault prediction method and device, electronic equipment and storage medium |
CN112988440B (en) * | 2021-02-23 | 2023-08-01 | 山东英信计算机技术有限公司 | System fault prediction method and device, electronic equipment and storage medium |
CN113806178A (en) * | 2021-09-22 | 2021-12-17 | 中国建设银行股份有限公司 | Cluster node fault detection method and device |
CN113806178B (en) * | 2021-09-22 | 2024-06-28 | 中国建设银行股份有限公司 | Cluster node fault detection method and device |
CN114169651A (en) * | 2022-02-14 | 2022-03-11 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
CN114169651B (en) * | 2022-02-14 | 2022-04-19 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
CN114676105A (en) * | 2022-03-29 | 2022-06-28 | 国家电网有限公司信息通信分公司 | Log data preprocessing method and device |
WO2023231192A1 (en) * | 2022-05-31 | 2023-12-07 | 中电信数智科技有限公司 | Srv6-based intelligent network and device fault prediction method and system |
CN115033889A (en) * | 2022-06-22 | 2022-09-09 | 中国电信股份有限公司 | Illegal copyright detection method and device, storage medium and computer equipment |
CN115033889B (en) * | 2022-06-22 | 2023-10-31 | 中国电信股份有限公司 | Illegal right-raising detection method and device, storage medium and computer equipment |
CN115426276A (en) * | 2022-08-22 | 2022-12-02 | 神华准格尔能源有限责任公司 | Monitoring method for strip mine 5G major equipment and cloud server |
CN115426276B (en) * | 2022-08-22 | 2024-03-12 | 神华准格尔能源有限责任公司 | Method for monitoring 5G major equipment of strip mine and cloud server |
CN116192612A (en) * | 2023-04-23 | 2023-05-30 | 成都新西旺自动化科技有限公司 | System fault monitoring and early warning system and method based on log analysis |
WO2024241491A1 (en) * | 2023-05-23 | 2024-11-28 | 三菱電機株式会社 | Information processing device, analysis system, analysis method, and program |
JP7504307B1 (en) | 2023-05-23 | 2024-06-21 | 三菱電機株式会社 | Information processing device, analysis system, analysis method, and program |
CN116520817B (en) * | 2023-07-05 | 2023-08-29 | 贵州宏信达高新科技有限责任公司 | ETC system running state real-time monitoring system and method based on expressway |
CN116520817A (en) * | 2023-07-05 | 2023-08-01 | 贵州宏信达高新科技有限责任公司 | ETC system running state real-time monitoring system and method based on expressway |
CN117348586B (en) * | 2023-10-11 | 2024-02-27 | 江苏云涌电子科技股份有限公司 | Event sequence record SOE implementation method based on energy storage EMS system |
CN117348586A (en) * | 2023-10-11 | 2024-01-05 | 江苏云涌电子科技股份有限公司 | Event sequence record SOE implementation method based on energy storage EMS system |
CN118192502A (en) * | 2024-03-26 | 2024-06-14 | 南京依维柯汽车有限公司 | Vehicle fault diagnosis system and method |
CN118740604A (en) * | 2024-07-18 | 2024-10-01 | 南京财经大学 | A cloud application fault location method and device based on knowledge analysis |
CN118740604B (en) * | 2024-07-18 | 2025-01-28 | 南京财经大学 | A cloud application fault location method and device based on knowledge analysis |
Also Published As
Publication number | Publication date |
---|---|
CN108038049B (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108038049A (en) | Real-time logs control system and control method, cloud computing system and server | |
CN110322048B (en) | Fault early warning method for production logistics conveying equipment | |
CN108536123B (en) | Fault diagnosis method for on-board train control equipment based on long short-term memory neural network | |
CN106844161B (en) | Abnormity monitoring and predicting method and system in calculation system with state flow | |
CN110570012B (en) | A Storm-based fault warning method and system for power plant production equipment | |
CN101950327B (en) | Equipment state prediction method based on fault tree information | |
CN107358347A (en) | Equipment cluster health state evaluation method based on industrial big data | |
CN114048870A (en) | An abnormal monitoring method of power system based on intelligent mining of log features | |
CN106504116A (en) | Based on the stability assessment method that operation of power networks is associated with transient stability margin index | |
CN111435366A (en) | Equipment fault diagnosis method and device and electronic equipment | |
CN110134566A (en) | A method for monitoring information system performance in cloud environment based on tag technology | |
CN115809183A (en) | Method for discovering and disposing information-creating terminal fault based on knowledge graph | |
CN103761173A (en) | Log based computer system fault diagnosis method and device | |
CN105893208A (en) | Cloud computing platform system fault prediction method based on hidden semi-Markov models | |
CN110399278B (en) | Alarm fusion system and method based on data center anomaly monitoring | |
CN104777827A (en) | Method for diagnosing fault of high-speed railway signal system vehicle-mounted equipment | |
CN113204914B (en) | Flight data abnormity interpretation method based on multi-flight data characterization modeling | |
CN108763048B (en) | A method for early warning and reliability evaluation of hard disk failure based on particle filter | |
CN111581056B (en) | Software engineering database maintenance and early warning system based on artificial intelligence | |
CN113485878B (en) | Multi-data center fault detection method | |
CN109469919B (en) | Power station air preheater ash blocking monitoring method based on weight clustering | |
CN115665787A (en) | A low-overhead AMF network intelligent fault diagnosis method based on machine learning | |
CN115758908A (en) | An online prediction method of alarms in the case of alarm floods based on deep learning | |
CN114676791A (en) | Electric power system alarm information processing method based on fuzzy evidence reasoning | |
CN112118127B (en) | Service reliability guarantee method based on fault similarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |