CN108038049A

CN108038049A - Real-time logs control system and control method, cloud computing system and server

Info

Publication number: CN108038049A
Application number: CN201711333074.7A
Authority: CN
Inventors: 裴庆祺; 赵伟伟; 王磊
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2018-05-15
Anticipated expiration: 2037-12-13
Also published as: CN108038049B

Abstract

The invention belongs to the technical field of cloud computing, and discloses a real-time log control system and control method, a cloud computing system and a server. Through the analysis of log recording events, error information is classified, filtered, aggregated, extracted into sequences, and trained Fault model and calculate the probability of the sequence belonging to the fault sequence and the probability of the non-fault sequence, and use Bayesian classification theory to obtain the results and make predictions. The present invention classifies, filters, and aggregates all error information through the analysis of log recording events, extracts them into sequences, trains the fault model and calculates the probability that the sequence belongs to a fault sequence and the probability of a non-fault sequence, using Bayeux Compared with a large number of rule matching, the results and predictions of Adams classification theory have improved the speed of judgment; fault prediction research is of great significance for reducing the burden of network management and maintenance and reducing the loss caused by network faults.

Description

Real-time log control system and control method, cloud computing system and server

技术领域technical field

本发明属于云计算技术领域，尤其涉及一种实时日志控制系统及控制方法、云计算系统及服务器。The invention belongs to the technical field of cloud computing, and in particular relates to a real-time log control system and control method, a cloud computing system and a server.

背景技术Background technique

随着计算机技术的高速发展，云计算成为最重要的计算机领域之一，云计算服务深入到每个人的生活和工作当中。能够通过对实时数据的计算，基于机器学习算法对于云计算系统中可能发生的故障进行提前预测，预留出故障响应时间，同时还支持弹性地平扩展集群的处理能力，以适应不断增长的数据量和用户需求。对海量日志数据进行实时计算处理，从数据中挖掘分析出系统的状态、故障预测方面具有良好的发展方向和应用前景。With the rapid development of computer technology, cloud computing has become one of the most important computer fields, and cloud computing services have penetrated into everyone's life and work. Through the calculation of real-time data, based on the machine learning algorithm, it can predict the possible faults in the cloud computing system in advance, reserve the fault response time, and also support the elastic horizontal expansion of the processing capacity of the cluster to adapt to the growing data volume and user needs. Real-time calculation and processing of massive log data, mining and analysis of system status and fault prediction from the data has a good development direction and application prospect.

综上所述，现有技术存在的问题是：原有的故障预测模型中，一方面，状态持续时间分布大多默认为指数型分布，而实际中故障的状态概率变化并不满足指数型；另一方面，在故障状态检测值概率做了离散化处理，这对大数据环境进行实验分析会有意料之外的影响，故本内容采用状态持续时间分布和状态观察值概率分布钧进行连续化分布即假定威布尔分布，采用改进的预测模型可提高诊断和预测的概率值。To sum up, the problems existing in the existing technology are: in the original fault prediction model, on the one hand, the state duration distribution is mostly exponential distribution by default, but the state probability change of the fault does not satisfy the exponential type in reality; On the one hand, the detection value probability of the fault state has been discretized, which will have an unexpected impact on the experimental analysis of the big data environment. Therefore, this content uses the state duration distribution and the state observation value probability distribution to carry out continuous distribution. That is, assuming Weibull distribution, the probability value of diagnosis and prediction can be improved by adopting an improved prediction model.

发明内容Contents of the invention

针对现有技术存在的问题，本发明提供了一种实时日志控制系统及控制方法、云计算系统及服务器。Aiming at the problems existing in the prior art, the present invention provides a real-time log control system and control method, a cloud computing system and a server.

本发明是这样实现的，一种实时日志控制方法，所述实时日志控制方法通过对于日志记录事件的分析，将错误信息进行分类、过滤、聚合操作，提取成为序列，训练故障模型并计算序列属于故障序列的概率和非故障序列的概率，使用贝叶斯分类理论得出结果，做出预测。The present invention is achieved in this way, a real-time log control method, the real-time log control method classifies, filters, and aggregates error information through the analysis of log record events, extracts them into sequences, trains fault models and calculates the sequence belongs to The probability of a fault sequence and the probability of a non-fault sequence are derived using Bayesian classification theory to make predictions.

进一步，所述实时日志控制方法具体包括：Further, the real-time log control method specifically includes:

步骤一，收集分布式系统中各个节点上的日志文件数据，通过增量检查将新产生日志数据实时地发送给收集端；Step 1: Collect log file data on each node in the distributed system, and send newly generated log data to the collector in real time through incremental checks;

步骤二，删除在某一时间段内相同位置报告的相同类型事件，删除冗余事件，通过设置时间阈值表示用于执行事件过滤的时间窗口；通过移除某一时间段内由多个不同位置报告的相似事件，删除日志中的冗余事件，将数据流保存到时序数据库中；使用相似性Sim(D₁，D₂)来判断：Step 2, delete the same type of events reported at the same location within a certain period of time, delete redundant events, by setting the time threshold Represents the time window used to perform event filtering; by removing similar events reported by multiple different locations within a certain period of time, delete redundant events in the log, and save the data stream into the time series database; use the similarity Sim( D ₁ , D ₂ ) to judge:

其中D₁，D₂表示两个序列，W_1K，W_2K表示D1、D2序列的向量项，相似度即两个向量夹角的余弦值来表示，Sim(D₁，D₂)越大，表示两者相似度越高；Among them, D ₁ and D ₂ represent two sequences, W _1K and W _2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D ₁ , D ₂ ), Indicates the higher the similarity between the two;

步骤三，在每条数据存储到数据表时，利用SQL语句按照时间戳、进程号、记录级别、进程模块、分隔符、记录信息分割记录；Step 3, when each piece of data is stored in the data table, use the SQL statement to divide the records according to the timestamp, process number, record level, process module, delimiter, and record information;

步骤四，利用SQL语句将处理过的标准格式化数据进行持久化存储；Step 4, using SQL statements to persist the processed standard formatted data;

步骤五，提取日志故障序列；Step 5, extracting the log fault sequence;

步骤六，聚类标准根据序列的似然值作为度量值来计算，采用层次聚类算法实现故障相关事件分组，其中：Step six, the clustering criteria are based on the likelihood value of the sequence Computed as a metric, a hierarchical clustering algorithm is used to group fault-related events, where:

S＝[s_i]表示一个长为L状态序列，为在状态s_i(k)在初始状态概率向量π＝[π_i]下的观测值的概率矩阵；S=[s _i ] represents a long L state sequence, is the probability matrix of the observed values under the initial state probability vector π=[π _i ] in the state s _i (k);

步骤七，采用改进的HSMM和贝叶斯网络BayesNet相结合，对实时日志数据做出故障预测；Step seven, using the combination of improved HSMM and Bayesian network BayesNet to make fault prediction for real-time log data;

标准HSMM可由状态之间转化概率矩阵G(t)＝[g_ij(t)]、状态s_i(k)在初始状态概率向量π＝[π_i]下的观测值的概率矩阵B＝b_i(k)，定义为将状态持续时间概率分布连续化；将状态持续时间的分布作为连续分布来处理，并且假设其服从威布尔分布来描述状态持续时间概率分布，状态的状态持续时间概率分布f_i(l)为：The standard HSMM can be converted from state to state by probability matrix G(t)=[g _ij (t)], state s _i (k) under initial state probability vector π=[π _i ], probability matrix B= _bi (k), defined as The state duration probability distribution is continuous; the state duration distribution is treated as a continuous distribution, and it is assumed to obey the Weibull distribution to describe the state duration probability distribution. The state state duration probability distribution f _i (l) is:

f_i(l)＝αβ(αl)^β-1e^-(αl)β；f _i (l) = αβ(αl) ^β-1 e ^-(αl)β ;

式中：α、β分别为威布尔分布的尺度参数和形状参数；In the formula: α and β are the scale parameter and shape parameter of Weibull distribution respectively;

将状态监测值概率分布连续化；同样设定其服从威布尔分布，状态检测值概率分布函数ξ_i(θ)为：The probability distribution of the state monitoring value is continuous; it is also set to obey the Weibull distribution, and the probability distribution function ξ _i (θ) of the state detection value is:

其中α_i、β_i为各状态阶段的威布尔分布的参数；改进的HSMM模型可描述为 Among them, α _i and β _i are the parameters of Weibull distribution in each state stage; the improved HSMM model can be described as

步骤八，故障和非故障模型进行训练，参数和目标是评估，给定一个观察序列O＝[o₁，o₂，...，o_l]是否为故障相关序列；计算分类模型的序列似然值，随后被分类为无故障或故障贝叶斯决策理论；Step 8, fault and non-fault models are trained, parameters and The goal is to evaluate, given a sequence of observations O = [o ₁ , o ₂ , ..., o _l ], whether it is a fault-related sequence; compute the sequence likelihood for a classification model, subsequently classified as fault-free or fault-Bayes Adams decision theory;

步骤九，故障结果预判：Step 9, predict the failure result:

将一个序列标记成为故障相关事件序列，系统发出故障预测；其中表示错误的将故障相关序列判断成为故障无关序列的代价，P(F)表示故障的概率，表示对序列似然值取对数。A sequence is marked as a fault-related event sequence, and the system issues a fault prediction; where Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates taking the logarithm of the sequence likelihood.

进一步，所述提取日志故障序列具体包括：Further, the extraction log failure sequence specifically includes:

第一步，提取错误事件序列：利用SQL语句，根据日志等级将ERROR级别的记录过提取出来，保留时间戳和文本消息信息；The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain the timestamp and text message information;

第二步，合并相似错误事件：对事件序列利用Levenshtein编辑距离算法，将相似度较大的错误事件合并；最小编辑距离包含子最小编辑距离；The second step is to merge similar error events: use the Levenshtein edit distance algorithm for event sequences to merge error events with greater similarity; the minimum edit distance includes sub-minimum edit distance;

其中d_[i-1，j]+1代表目标日志插入一个字母，d_[i，j-1]-1代表匹配日志删除一个字母；然后x_i＝y_j时，不需要修改，所以和上一步d_[i-1，j-1]+1代价相同，否则+1，d_[i，j_]表示以上三者中最小的一项；Among them, d _{[i-1, j]} + 1 means that the target log inserts a letter, and d _{[i, j-1]} -1 means that the matching log deletes a letter; then when x _i = y _j , no modification is required, so it is the same as above The cost of one step d _{[i-1, j-1]} +1 is the same, otherwise +1, d _[i, j _] represents the smallest item among the above three;

第三步，错误事件分类：经过上一步将错误事件合并后，根据错误事件的文本信息中的关键字将相似的错误事件进行归类，并赋值ID，保存在数据库中；The third step is to classify error events: after merging error events in the previous step, classify similar error events according to the keywords in the text information of error events, assign IDs, and store them in the database;

第四步，提取序列：按照时间顺序，提取在故障发生前一段时间内的事件，设定为故障相关事件序列，为故障前置时间，当前故障事件为相关故障事件；非故障相关事件序列则是在系统未发生故障的时间区间内的事件序列。The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs events within, set to a sequence of fault-related events, is the fault lead time, the current fault event is the related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail.

本发明的另一目的在于提供一种所述实时日志控制方法的实时日志控制系统，所述实时日志控制系统包括：日志信息处理模块、日志故障分析模块。Another object of the present invention is to provide a real-time log control system according to the real-time log control method. The real-time log control system includes: a log information processing module and a log failure analysis module.

进一步，所述日志故障分析模块包括：Further, the log failure analysis module includes:

收集日志信息单元，用于收集分布式系统中各个节点上的日志文件数据，日志收集功能应该允许自定义所要监听的日志文件，通过增量检查的方法，将新产生日志数据实时地发送给收集端；The log information collection unit is used to collect the log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send the newly generated log data to the collection in real time through the incremental check method. end;

日志信息过滤单元，用于进行数据的去冗余和过滤；The log information filtering unit is used for de-redundancy and filtering of data;

日志信息标准格式化单元，用于处理过的日志信息进行数据标准格式化；The log information standard formatting unit is used for data standard formatting of the processed log information;

日志存储单元，用于将处理过的标准格式化数据进行持久化存储。The log storage unit is used for persistent storage of the processed standard formatted data.

提取日志事件序列单元；extract log event sequence unit;

故障相关事件聚类单元，用于利用事件提前训练出一个小的隐半马尔可夫模型，求序列似然值；The fault-related event clustering unit is used to use events to train a small hidden semi-Markov model in advance to calculate the sequence likelihood value;

故障预测单元，使用隐半马尔可夫模型和贝叶斯分贝理论，判定序列是否为故障相关序列；The fault prediction unit uses the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence;

故障结果判断输出单元：当判定为故障相关序列时，系统发出故障警告流，输出状态故障预警。Fault result judgment output unit: When it is judged to be a fault-related sequence, the system sends out a fault warning flow and outputs a status fault warning.

所述提取日志事件序列单元进一步包括：The extraction log event sequence unit further includes:

提取错误事件记录单元，根据日志等级将ERROR级别的记录过提取出来，保留时间戳、进程模块和文本消息信息；Extract the error event record unit, extract the records of the ERROR level according to the log level, and retain the time stamp, process module and text message information;

合并相似错误事件单元，将错误事件序列利用Levenshtein编辑距离算法，将相似度较大的错误事件合并；Merge similar error event units, and use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity;

错误事件分类单元，对事件序列采用Levenshtein编辑距离算法，将相似的错误事件进行归类，并赋值ID；The error event classification unit uses the Levenshtein edit distance algorithm for the event sequence to classify similar error events and assign IDs;

提取故障相关序列单元，按照时间先后顺序，提取故障前一段时间内的事件，设定为故障前置事件。Extract fault-related sequence units, and extract events in a period of time before the fault according to the chronological order, and set them as fault pre-events.

本发明的另一目的在于提供一种利用所述实时日志控制方法的云计算系统。Another object of the present invention is to provide a cloud computing system utilizing the real-time log control method.

现今故障预测研究工作主要有三类方法，包括基于日志频率的故障检测模型，基于消息频率的故障检测模型和基于状态转移的故障检测模型。There are three main types of fault prediction research work today, including fault detection models based on log frequency, fault detection models based on message frequency, and fault detection models based on state transition.

本发明在系统运行时间内实时收集日志信息并进行聚类处理，通过分析事件日志使用机器学习的算法和模型，实现对系统未来可能发生的故障的预测，在系统运行过程中对系统故障进行提前排查和定位，用于提高系统运维效率和预防紧急故障事件。本发明通过对于日志记录事件的分析，将所有的错误信息进行分类、过滤、聚合等操作，提取成为序列，训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率，使用贝叶斯分类理论得出结果，做出预测。The present invention collects log information in real time during system operation time and performs clustering processing, and uses machine learning algorithms and models by analyzing event logs to realize the prediction of possible future failures of the system, and to predict system failures in advance during system operation Troubleshooting and positioning are used to improve system operation and maintenance efficiency and prevent emergency failures. The present invention classifies, filters, and aggregates all error information through the analysis of log recording events, extracts them into sequences, trains the fault model and calculates the probability that the sequence belongs to a fault sequence and the probability of a non-fault sequence, using Bayeux The theory of Adams classification draws results and makes predictions.

该方法的有效判断标准主要由三个参数来决定，即准确率、召回率以及F-measure参数，准确率反应的是所有预测中正确的比率,召回率反应的是所有故障中被正确预测出来的比率,F.measure是结合准确率和召回率的一个综合衡量值；The effective judgment standard of this method is mainly determined by three parameters, namely the accuracy rate, recall rate and F-measure parameter. The accuracy rate reflects the correct ratio of all predictions, and the recall rate reflects the correct prediction of all faults. The ratio of F.measure is a comprehensive measure combining precision and recall;

预测情况如下表1：The predictions are as follows in Table 1:

预测结果\实际结果Predicted Results\Actual Results 系统故障system error 系统正常The system is normal 系统故障system error TruePositive(TP)TruePositive (TP) FalsePositive(FP)False Positive (FP) 系统正常The system is normal FalseNegative(FN)False Negative (FN) TrueNegative(TN)True Negative (TN)

表1预测情况Table 1 Forecast

预测有效性参数如表2：The predictive validity parameters are shown in Table 2:

表2有效性参数表达式Table 2 Validity parameter expression

经过系统实验得出下面数据结论，可看出本次系统在准确率上优于未改进之前After the system experiment, the following data conclusions can be drawn. It can be seen that the accuracy of this system is better than that before no improvement.

附图说明Description of drawings

图1是本发明实施例提供的实时日志控制系统结构示意图；Fig. 1 is a schematic structural diagram of a real-time log control system provided by an embodiment of the present invention;

图中：1、日志信息处理模块；1-1、收集日志信息单元；1-2、日志信息过滤单元；1-3、日志信息标准格式化单元；1-4、日志存储单元；2、日志故障分析模块；2-1、提取日志事件序列单元；2-1-1、提取错误事件记录单元；2-1-2、合并相似错误事件单元；2-1-3、错误事件分类单元；2-1-4、提取故障相关序列单元；2-2、故障相关事件聚类单元；2-3、故障预测单元；2-4、故障结果判断输出单元。In the figure: 1. Log information processing module; 1-1. Collecting log information unit; 1-2. Log information filtering unit; 1-3. Log information standard formatting unit; 1-4. Log storage unit; 2. Log Fault analysis module; 2-1, extraction log event sequence unit; 2-1-1, extraction error event recording unit; 2-1-2, merging similar error event unit; 2-1-3, error event classification unit; 2 -1-4. Extracting fault-related sequence unit; 2-2. Fault-related event clustering unit; 2-3. Fault prediction unit; 2-4. Fault result judgment output unit.

图2是本发明实施例提供的实时日志控制方法流程图。Fig. 2 is a flowchart of a real-time log control method provided by an embodiment of the present invention.

图3是本发明实施例提供的实时日志控制方法的实现流程图。Fig. 3 is a flow chart of realizing the real-time log control method provided by the embodiment of the present invention.

图4是本发明实施例提供的故障序列提取示意图。Fig. 4 is a schematic diagram of fault sequence extraction provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

下面结合附图对本发明的应用原理作详细的描述。The application principle of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示，本发明实施例提供的实时日志控制系统包括：日志信息处理模块1、日志故障分析模块2。As shown in FIG. 1 , the real-time log control system provided by the embodiment of the present invention includes: a log information processing module 1 and a log fault analysis module 2 .

日志故障分析模块1包括：Log failure analysis module 1 includes:

收集日志信息单元1-1：用于收集分布式系统中各个节点上的日志文件数据，日志收集功能应该允许自定义所要监听的日志文件，通过增量检查的方法，将新产生日志数据实时地发送给收集端。Collecting log information unit 1-1: used to collect log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored. Through incremental checking, the newly generated log data will be collected in real time. sent to the collector.

日志信息过滤单元1-2：用于进行数据的去冗余和过滤。Log information filtering unit 1-2: used for de-redundancy and filtering of data.

日志信息标准格式化单元1-3：用于处理过的日志信息进行数据标准格式化，比如按照：时间戳、进程号、记录级别、进程模块、分隔符、记录信息，其中，记录级别分为几大类，包括：ERROR、WARING、TRACE、INFO、DUBUG、CRITICAL、AUDIT，级别越靠前等级越高，等级越高代表事件的重要程度越高。Log information standard formatting unit 1-3: used for data standard formatting of processed log information, for example, according to: timestamp, process number, record level, process module, separator, record information, where the record level is divided into Several categories, including: ERROR, WARING, TRACE, INFO, DUBUG, CRITICAL, AUDIT, the higher the level, the higher the level, and the higher the level, the higher the importance of the event.

日志存储单元1-4：用于将处理过的标准格式化数据进行持久化存储，便于后期数据的提取分析。Log storage unit 1-4: used for persistent storage of processed standard formatted data, which is convenient for later data extraction and analysis.

日志故障分析模块2包括：Log failure analysis module 2 includes:

提取日志事件序列单元2-1：Extract log event sequence unit 2-1:

故障相关事件聚类单元2-2，用于利用事件提前训练出一个小的隐半马尔可夫(HSMM)模型，求序列似然值即给定序列利用训练模型产生的观察序列；The fault-related event clustering unit 2-2 is used to train a small hidden semi-Markov (HSMM) model in advance by using the event, and calculate the sequence likelihood value, that is, the observation sequence generated by the training model for a given sequence;

故障预测单元2-3：使用隐半马尔可夫模型和贝叶斯分贝理论，判定序列是否为故障相关序列；Fault prediction unit 2-3: use the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence;

故障结果判断输出单元2-4：当判定为故障相关序列时，系统发出故障警告流，输出状态故障预警。Fault result judgment output unit 2-4: When it is judged to be a fault-related sequence, the system sends a fault warning flow and outputs a status fault warning.

提取日志事件序列单元2-1进一步包括：Extract log event sequence unit 2-1 further includes:

提取错误事件记录单元2-1-1：根据日志等级将ERROR级别的记录过提取出来，保留时间戳、进程模块和文本消息等信息；Extract error event record unit 2-1-1: Extract the records of ERROR level according to the log level, and retain information such as time stamp, process module and text message;

合并相似错误事件单元2-1-2：将错误事件序列利用Levenshtein编辑距离算法，将相似度较大的错误事件合并；Merge similar error events unit 2-1-2: Use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity;

错误事件分类单元2-1-3：对事件序列采用Levenshtein编辑距离算法，将相似的错误事件进行归类，并赋值ID；Error event classification unit 2-1-3: Use the Levenshtein edit distance algorithm for event sequences to classify similar error events and assign IDs;

提取故障相关序列单元2-1-4：按照时间先后顺序，提取故障前一段时间内的事件，设定为故障前置事件。Extract fault-related sequence unit 2-1-4: According to the chronological order, extract the events within a period of time before the fault, and set it as the pre-fault event.

如图2所示，本发明实施例提供的实时日志控制方法包括以下步骤：As shown in Figure 2, the real-time log control method provided by the embodiment of the present invention includes the following steps:

S201：通过对于日志记录事件的分析，将所有的错误信息进行分类、过滤、聚合等操作，提取成为序列；S201: through the analysis of the log record event, classify, filter, aggregate and other operations are performed on all the error information, and extract them into a sequence;

S202：训练故障模型并计算该序列属于故障序列的概率和非故障序列的概率，使用贝叶斯分类理论得出结果，做出预测。S202: Train the fault model and calculate the probability that the sequence belongs to the fault sequence and the probability of the non-fault sequence, and use Bayesian classification theory to obtain the result and make a prediction.

下面结合附图对本发明的应用原理作进一步的描述。The application principle of the present invention will be further described below in conjunction with the accompanying drawings.

比起利用故障关键字进行大量的规则匹配来说，在本发明中，采用改进的HSMM(隐马尔科夫模型)和Bayesdecisiontheory(贝叶斯分类理论)，直接计算一个错误序列属于故障序列的概率，提高判断速度。Compared with utilizing fault keywords to carry out a large amount of rule matching, in the present invention, adopt improved HSMM (hidden Markov model) and Bayesdecision theory (Bayesian classification theory), directly calculate the probability that an error sequence belongs to fault sequence , improve the speed of judgment.

如图3所示，本发明实施例提供的实时日志控制方法具体步骤如下：As shown in Figure 3, the specific steps of the real-time log control method provided by the embodiment of the present invention are as follows:

1、日志信息处理过程1. Log information processing process

步骤1，日志信息收集Step 1, log information collection

系统应该能够收集分布式系统中各个节点上的日志文件数据，日志收集功能应该允许自定义所要监听的日志文件，通过增量检查的方法，即将新产生日志数据实时地发送给收集端。The system should be able to collect log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send newly generated log data to the collector in real time through incremental checking.

步骤2，日志信息过滤Step 2, log information filtering

有两种方法：一个是时间过滤，另一个是空间过滤。当系统检测到异常时，在系统发生故障之前，系统会持续输出警告信息流。同样地，一旦系统发生故障，在解决故障问题之前日志中可能会多次反复出现故障信息。There are two methods: one is temporal filtering and the other is spatial filtering. When the system detects an anomaly, the system will continuously output a stream of warning information until the system fails. Likewise, once a system fails, the failure message may appear repeatedly in the log many times until the failure problem is resolved.

时间过滤方法通过删除在某一时间段内相同位置报告的相同类型事件，从而删除冗余事件，通过设置时间阈值表示用于执行事件过滤的时间窗口。空间过滤方法通过移除某一时间段内由多个不同位置报告的相似事件，删除日志中的冗余事件，将数据流保存到时序数据库中，节省空间并提高效率。通常使用相似性Sim(D₁，D₂)来判断：The time filtering method removes redundant events by removing events of the same type reported at the same location within a certain period of time, by setting a time threshold Indicates the time window used to perform event filtering. The spatial filtering method saves space and improves efficiency by removing similar events reported by multiple different locations within a certain period of time, deleting redundant events in logs, and saving data streams to time series databases. Usually use the similarity Sim(D ₁ , D ₂ ) to judge:

其中D₁，D₂表示两个序列，W_1K，W_2K表示D1、D2序列的向量项，相似度即两个向量夹角的余弦值来表示，Sim(D₁，D₂)越大，表示两者相似度越高。Among them, D ₁ and D ₂ represent two sequences, W _1K and W _2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D ₁ , D ₂ ), Indicates the higher the similarity between the two.

步骤3，日志格式标准化。Step 3, log format standardization.

在将每条数据存储到数据表时，利用SQL语句按照时间戳、进程号、记录级别、进程模块、分隔符、记录信息等分割记录。When storing each piece of data in the data table, use SQL statements to divide records according to timestamp, process number, record level, process module, delimiter, record information, etc.

步骤4，日志存储。Step 4, log storage.

利用SQL语句将处理过的标准格式化数据进行持久化存储，便于后期数据的提取分析。Use SQL statements to store the processed standard formatted data persistently, which is convenient for later data extraction and analysis.

2.日志故障分析：2. Log failure analysis:

在故障表现和系统状态之间建立基于概率因果关系，通过故障出现的先验概率来对隐半马尔科夫模型和贝叶斯网络进行训练，诊断时根据先验概率求解故障表现下各种系统状态的后验概率，直观表达变量的联合概率分布，同时计算各特征造成故障的概率。Establish a probability-based causal relationship between the fault performance and the system state, train the hidden semi-Markov model and Bayesian network through the prior probability of fault occurrence, and solve various systems under the fault performance according to the prior probability when diagnosing The posterior probability of the state, intuitively expresses the joint probability distribution of the variables, and calculates the probability of failure caused by each feature at the same time.

步骤1，提取日志故障序列。Step 1, extract the log fault sequence.

第一步，提取错误事件序列：利用SQL语句，根据日志等级将ERROR级别的记录过提取出来，保留时间戳和文本消息等信息；The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain information such as time stamps and text messages;

第二步，合并相似错误事件：对上一步骤的事件序列利用Levenshtein编辑距离算法，将相似度较大的错误事件合并；The second step is to merge similar error events: use the Levenshtein edit distance algorithm for the event sequence in the previous step to merge the error events with greater similarity;

该算法使用了动态规划的算法策略，该问题具备最优子结构，最小编辑距离包含子最小编辑距离；The algorithm uses the algorithm strategy of dynamic programming, the problem has an optimal substructure, and the minimum edit distance includes sub-minimum edit distance;

其中d_[i-1，j]+1代表目标日志插入一个字母，d_[i，j-1]+1代表匹配日志删除一个字母；然后x_i＝y_j时，不需要修改，所以和上一步d_[i-1，j-1]+1代价相同，否则+1，d_[i，j]表示以上三者中最小的一项；Among them, d _{[i-1, j]} + 1 means that the target log inserts a letter, and d _{[i, j-1]} + 1 means that the matching log deletes a letter; then when x _i = y _j , no modification is required, so it is the same as above The cost of one step d _{[i-1, j-1]} +1 is the same, otherwise +1, d _{[i, j]} represents the smallest item among the above three;

第四步，提取序列：按照时间顺序，提取在故障发生前一段时间为的事件，设定为故障相关事件序列，为故障前置时间，当前故障事件为相关故障事件；非故障相关事件序列则是在系统未发生故障的时间区间内的事件序列，如图4所示：The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs For the event, set as the sequence of fault-related events, is the fault lead time, and the current fault event is a related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail, as shown in Figure 4:

步骤2，故障相关事件聚类。Step 2, clustering of fault-related events.

实际中，会有多种的故障相关事件序列可能导致同一种的系统故障，而这多种故障相关事件序列的特征是不同的，故需要进行聚类。In practice, there will be a variety of fault-related event sequences that may lead to the same system fault, and the characteristics of these various fault-related event sequences are different, so clustering is required.

聚类标准可根据序列的似然值作为度量值来计算，最后采用层次聚类算法实现故障相关事件分组，其中：The clustering criteria can be based on the likelihood value of the sequence Calculated as a metric, and finally a hierarchical clustering algorithm is used to group fault-related events, where:

S＝[s_i]表示一个长为L状态序列，b_si(o_i)为在状态s_i(k)在初始状态概率向量π＝[π_i]下的观测值的概率矩阵。S=[s _i ] represents a state sequence of length L, and b _si (o _i ) is the probability matrix of observed values in state s _i (k) under the initial state probability vector π=[π _i ].

步骤3，训练建立预测模型。Step 3, training and building a prediction model.

预测模型是网络故障预测的关键，所构造的特征直接影响预测模型的性能。本次采用隐半马尔可夫模型(HSMM)和贝叶斯网络(Bayes Net)相结合，针对实时日志数据做出故障预测。The prediction model is the key to network fault prediction, and the constructed features directly affect the performance of the prediction model. This time, a combination of Hidden Semi-Markov Model (HSMM) and Bayesian Network (Bayes Net) is used to make fault prediction for real-time log data.

标准HSMM可由状态之间转化概率矩阵G(t)＝[g_ij(t)]、状态s_i(k)在初始状态概率向量π＝[π_i]下的观测值的概率矩阵B＝b_i(k)，定义为 The standard HSMM can be converted from state to state by probability matrix G(t)=[g _ij (t)], state s _i (k) under initial state probability vector π=[π _i ], probability matrix B= _bi (k), defined as

本次对HSMM的改进方面有：将状态持续时间概率分布连续化。将状态持续时间的分布作为连续分布来处理，并且假设其服从威布尔分布来描述状态持续时间概率分布，即状态的状态持续时间概率分布f_i(l)为：The improvements to HSMM this time include: continuous state duration probability distribution. Treat the distribution of the state duration as a continuous distribution, and assume that it obeys the Weibull distribution to describe the probability distribution of the state duration, that is, the state duration probability distribution f _i (l) of the state is:

f_i(l)＝αβ(αl)^β-1e^-(αl)β；f _i (l) = αβ(αl) ^β-1 e ^-(αl)β ;

式中：d、β分别为威布尔分布的尺度参数和形状参数；In the formula: d and β are the scale parameter and shape parameter of Weibull distribution respectively;

将状态监测值概率分布连续化。同样设定其服从威布尔分布，状态检测值概率分布函数ξ_i(θ)为：The probability distribution of state monitoring values is continuous. It is also assumed that it obeys the Weibull distribution, and the probability distribution function ξ _i (θ) of the state detection value is:

其中α_i、β_i为各状态阶段的威布尔分布的参数；故改进的HSMM模型可描述为 Among them, α _i and β _i are the parameters of Weibull distribution in each state stage; therefore, the improved HSMM model can be described as

步骤4，故障预测。Step 4, fault prediction.

假设的故障和非故障模型进行训练，即参数和目标是评估，给定一个观察序列(错误序列)O＝[o₁，o₂，...，o_l]是否为故障相关序列。首先计算分类模型的序列似然值，随后被分类为无故障或故障贝叶斯决策理论。Assumed faulty and non-faulty models to train, i.e. parameters and The goal is to evaluate, given an observation sequence (error sequence) O=[o ₁ , o ₂ , . . . , o _l ], whether it is a fault-related sequence. Sequence likelihoods are first computed for classification models, which are subsequently classified as failure-free or failure-based Bayesian decision theory.

步骤5，故障结果预判：Step 5, fault result prediction:

上面公式成立时，将一个序列标记成为故障相关事件序列，系统发出故障预测。其中表示错误的将故障相关序列判断成为故障无关序列的代价，P(F)表示故障的概率，表示对序列似然值取对数，这样可防止序列似然值太小而发生溢出问题。通过这样的方法，可以对每个序列进行判断，做出故障预测。When the above formula is established, a sequence is marked as a fault-related event sequence, and the system issues a fault prediction. in Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates that the logarithm of the sequence likelihood value is taken, which can prevent the overflow problem from occurring when the sequence likelihood value is too small. Through such a method, each sequence can be judged and a fault prediction can be made.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims

1. A real-time log control method, characterized in that, the real-time log control method classifies, filters, and aggregates error information through the analysis of the log record event, extracts it as a sequence, trains the fault model and calculates that the sequence belongs to the fault The probabilities of sequences and the probabilities of non-failure sequences are derived using Bayesian classification theory to make predictions.

2. The real-time log control method according to claim 1, wherein the real-time log control method specifically comprises:

Step 1: Collect log file data on each node in the distributed system, and send newly generated log data to the collector in real time through incremental checks;

Step 2, delete the same type of events reported at the same location within a certain period of time, delete redundant events, by setting the time threshold Represents the time window used to perform event filtering; by removing similar events reported by multiple different locations within a certain period of time, delete redundant events in the log, and save the data stream into the time series database; use the similarity Sim( D ₁ , D ₂ ) to judge:

<mrow><mi>S</mi><mi>i</mi><mi>m</mi><mrow><mo>(</mo><msub><mi>D</mi><mn>1</mn></msub><mo>,</mo><msub><mi>D</mi><mn>2</mn></msub><mo>)</mo></mrow><mo>=</mo><mi>c</mi><mi>o</mi><mi>s</mi><mi>&theta;</mi><mo>=</mo><mfrac><mrow><msubsup><mo>&Sigma;</mo><mrow><mi>k</mi><mo>-</mo><mn>1</mo>mn></mrow><mi>n</mi></msubsup><msub><mi>W</mi><mrow><mn>1</mn><mi>K</mi></mrow></msub><mo>&times;</mo><msub><mi>W</mi><mrow><mn>2</mn><mi>K</mi></mrow></msub></mrow><msqrt><mrow><msup><mrow><mo>(</mo><msubsup><mo>&Sigma;</mo><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow><mi>n</mi></msubsup><msub><mi>W</mi><mrow><mn>1</mn><mi>K</mi></mrow></msub><mo>)</mo></mrow><mn>2</mn></msup><mo>&times;</mo><msup><mrow><mo>(</mo><msubsup><mo>&Sigma;</mo><mrow><mi>k</mi><mo>-</mo><mn>1</mn></mrow><mi>n</mi></msubsup><msub><mi>W</mi><mrow><mn>2</mn><mi>K</mi></mrow></msub><mo>)</mo></mrow><mn>2</mn></msup></mrow></msqrt></mfrac><mo>;</mo></mrow>

Among them, D ₁ and D ₂ represent two sequences, W _1K and W _2K represent the vector items of D1 and D2 sequences, and the similarity is represented by the cosine value of the angle between the two vectors. The greater Sim(D ₁ , D ₂ ), Indicates the higher the similarity between the two;

Step 3, when each piece of data is stored in the data table, use the SQL statement to divide the records according to the timestamp, process number, record level, process module, delimiter, and record information;

Step 4, using SQL statements to persist the processed standard formatted data;

Step 5, extracting the log fault sequence;

Step six, the clustering criteria are based on the likelihood value of the sequence Computed as a metric, a hierarchical clustering algorithm is used to group fault-related events, where:

S=[s _i ] represents a long L state sequence, is the probability matrix of the observed values under the initial state probability vector π=[π _i ] in the state s _i (k);

Step 7, using the combination of hidden semi-Markov model HSMM and Bayesian network Bayes Net to make fault prediction for real-time log data;

The standard HSMM can be converted from state to state by probability matrix G(t)=[g _ij (t)], state s _i (k) under initial state probability vector π=[π _i ], probability matrix B= _bi (k), defined as λ=(π, G(t), B); the state duration probability distribution is continuous; the state duration distribution is treated as a continuous distribution, and it is assumed to obey the Weibull distribution to describe the state Duration probability distribution, the state duration probability distribution f _i (l) of the state is:

f _i (l) = αβ(αl) ^β-1 e ^-(αl)β ;

In the formula: α and β are the scale parameter and shape parameter of Weibull distribution respectively;

The probability distribution of the state monitoring value is continuous; it is also set to obey the Weibull distribution, and the probability distribution function ξ _i (θ) of the state detection value is:

<mrow><msub><mi>&xi;</mi><mi>i</mi></msub><mrow><mo>(</mo><mi>&theta;</mi><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>y</mi><mi>t</mi></msub><mo>=</mo><mi>&theta;</mi><mo>|</mo><msub><mi>q</mi><mi>t</mi></msub><mo>=</mo><msub><mi>x</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><msub><mi>&alpha;</mi><mi>i</mi></msub><msub><mi>&beta;</mi><mi>i</mi></msub><msup><mrow><mo>(</mo><msub><mi>&alpha;</mi><mi>i</mi></msub><mi>&theta;</mi><mo>)</mo></mrow><mrow><msub><mi>&beta;</mi><mi>i</mi></msub><mo>-</mo><mn>1</mn></mrow></msup><msup><mi>e</mi><mrow><mo>(</mo><msub><mi>&alpha;</mi><mi>i</mi></msub><mi>&theta;</mi><mo>)</mo><msub><mi>&beta;</mi><mi>i</mi></msub></mrow></msup><mo>;</mo></mrow>

Among them, α _i and β _i are the parameters of Weibull distribution in each state stage; the improved HSMM model can be described as

Step 8, fault and non-fault models are trained, parameters and The goal is to evaluate, given a sequence of observations O = [o ₁ , o ₂ , ..., o _l ], whether it is a fault-related sequence; compute the sequence likelihood for a classification model, subsequently classified as fault-free or fault-Bayes Adams decision theory;

Step 9, predict the failure result:

A sequence is marked as a fault-related event sequence, and the system issues a fault prediction; where Indicates the cost of incorrectly judging a fault-related sequence as a fault-independent sequence, P(F) represents the probability of a fault, Indicates taking the logarithm of the sequence likelihood.

3. the real-time log control method as claimed in claim 2, is characterized in that, described extraction log failure sequence specifically comprises:

The first step is to extract the error event sequence: use the SQL statement to extract the records of the ERROR level according to the log level, and retain the timestamp and text message information;

The second step is to merge similar error events: use the Levenshtein edit distance algorithm for event sequences to merge error events with greater similarity; the minimum edit distance includes sub-minimum edit distance;

Among them, d _{[i-1, j]} + 1 means that the target log inserts a letter, and d _{[i, j-1]} + 1 means that the matching log deletes a letter; then when x _i = y _j , no modification is required, so it is the same as above The cost of one step d _{[i-1, j-1]} +1 is the same, otherwise +1, d _{[i, j]} represents the smallest item among the above three;

The third step is to classify error events: after merging error events in the previous step, classify similar error events according to the keywords in the text information of error events, assign IDs, and store them in the database;

The fourth step is to extract the sequence: in chronological order, extract a period of time before the fault occurs events within, set to a sequence of fault-related events, is the fault lead time, the current fault event is the related fault event; the non-fault related event sequence is the event sequence in the time interval when the system does not fail.

4. A real-time log control system according to the real-time log control method of claim 1, wherein the real-time log control system comprises: a log information processing module and a log failure analysis module.

5. the real-time log control system as claimed in claim 4, is characterized in that, described log failure analysis module comprises:

The log information collection unit is used to collect the log file data on each node in the distributed system. The log collection function should allow customizing the log files to be monitored, and send the newly generated log data to the collection in real time through the incremental check method. end;

The log information filtering unit is used for de-redundancy and filtering of data;

The log information standard formatting unit is used for data standard formatting of the processed log information;

The log storage unit is used for persistent storage of the processed standard formatted data.

6. the real-time log control system as claimed in claim 4, is characterized in that, described log failure analysis module comprises:

extract log event sequence unit;

The fault-related event clustering unit is used to use events to train a small hidden semi-Markov model in advance to calculate the sequence likelihood value;

The fault prediction unit uses the hidden semi-Markov model and Bayesian decibel theory to determine whether the sequence is a fault-related sequence;

Fault result judgment output unit: When it is judged to be a fault-related sequence, the system sends out a fault warning flow and outputs a status fault warning.

7. The real-time log control system according to claim 6, wherein the extracting log event sequence unit further comprises:

Extract the error event record unit, extract the records of the ERROR level according to the log level, and retain the time stamp, process module and text message information;

Merge similar error event units, and use the Levenshtein edit distance algorithm to merge error event sequences with greater similarity;

The error event classification unit uses the Levenshtein edit distance algorithm for the event sequence to classify similar error events and assign IDs;

Extract fault-related sequence units, and extract events in a period of time before the fault according to the chronological order, and set them as fault pre-events.

8. A cloud computing system using the real-time log control method according to any one of claims 1-3.

9. A cloud computing server using the real-time log control method according to any one of claims 1-3.