CN108055152B

CN108055152B - Communication network information system abnormity detection method based on distributed service log

Info

Publication number: CN108055152B
Application number: CN201711309778.0A
Authority: CN
Inventors: 舒新建; 董刚松; 郭少勇; 金翼; 王世文; 安致嫄; 罗臻; 张筱筠; 李文萃; 徐思雅; 盛磊; 岳纲毅; 张宁宁
Original assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; Information and Telecommunication Branch of State Grid Henan Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; Information and Telecommunication Branch of State Grid Henan Electric Power Co Ltd
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2020-11-06
Anticipated expiration: 2037-12-11
Also published as: CN108055152A

Abstract

The invention discloses a communication network information system abnormity detection method based on distributed service logs, which can automatically compare real-time log streams with development templates by finding a comparison case during automatic abnormity detection by exploring service benchmarks in a network so as to achieve the purpose of detecting abnormity. The invention provides a time weight-based anomaly detection method for a control flow graph model, which generates a CFG (computational fluid dynamics) with good structural nodes through a set filtering scheme, and divides boundaries by utilizing the characteristic of interval disorder of nodes between loops or threads so as to further optimize the structure; the time weight of each edge is calculated, generating the TCFG. The TCFG can be compared with a real-time TCFG stream to find abnormality, and evaluation shows that the method has good accuracy and recall rate, and in a real environment, the accuracy rate is 80% and the recall rate is 65% on average.

Description

Communication network information system abnormity detection method based on distributed service log

Technical Field

The invention relates to the technical field of communication network information anomaly detection, in particular to a distributed service log-based communication network information system anomaly detection method.

Background

With the development of networks, most network service systems need to provide services for 7 × 24 hours, and when the system is abnormal, if the problem cannot be detected and repaired in time, huge economic loss will be caused. When a system is abnormal, developers usually find and process the abnormality through log analysis. The log records the system operation information, which is one of the most common methods for analyzing system anomalies. However, since the system is over-scaled, the system logs are distributed on different servers, and the distribution of the servers is large-scale and scattered, a method for developers to find out the logs to be analyzed manually to detect the abnormality will take a lot of time. In order to reduce the workload of developers, it is meaningful to study an abnormality detection method based on automatic log analysis.

For distributed service anomaly detection, the patent places a focus on developing a time-weight-based control flow graph (TCFG) model. The model development firstly generates a Control Flow Graph (CFG) model, then divides a boundary optimization structure, and then calculates time weight by adopting a K mean algorithm. And finally, comparing the online log stream with the mined TCFG model to realize difference diagnosis.

In order to solve the development situation of the prior art, the existing papers and patents are searched, compared and analyzed, and the following technical information with high relevance to the invention is screened out:

the technical scheme 1: the patent of 'a communication abnormity detection device and method' with application publication number of CN106209496A relates to a method of an abnormity detection device, which is mainly completed by three steps of the device: first, a communication unit. A plurality of data frames are accepted through a communication interface of a communication unit. The unit needs to include at least one communication interface, and the at least one communication interface includes a first interface and a second interface to meet the requirement of anomaly detection of different parts. Second, the detecting unit includes a technical unit and a time length determining unit. The counting unit is used for counting the number of data from the data marked as one frame ending bit to the data marked as one frame starting bit received by the communication unit. The time length determining unit is used for determining the time length between two frames of received data according to the counting value of the counting unit. And thirdly, the determining unit compares the detected time length with a preset time interval, when the time length is in the time interval, the communication of the communication unit is normal, otherwise, the communication of the communication unit is abnormal and is sent to the management server by the sending unit.

The technical scheme 2 is as follows: the patent of 'a method and a device for detecting communication abnormity' with application publication number CN106713012A relates to a method and a device for detecting ATM communication abnormity, which are mainly completed by four steps of the device: firstly, sending heartbeat packets in an ATM according to a preset sending interval; secondly, detecting a real-time communication load parameter of a heartbeat response packet corresponding to the heartbeat packet, wherein the communication load parameter is a delay rate or a packet loss rate; thirdly, determining the current communication state in the ATM according to the communication load parameter, wherein the communication state adopts a five-level hierarchical system; fourth, the determined current communication state is output. The invention has the advantage of detecting the abnormality when no service data exists.

Technical scheme 3: a "communication abnormality detection device, communication abnormality detection method, and program" patent application publication No. CN105323017A relates to a communication abnormality detection device that detects an abnormality in communication performed between a master device and a plurality of slave devices, and is mainly completed by three steps of the device: first, a detection unit of a communication abnormality detection device constantly observes a communication voltage of a communication line between a master device and a slave device; secondly, when the communication voltage is lower than the set value, the notification unit notifies the master device of an ID for identifying the slave device to which the communication is to be made, in addition to the alarm; third, the master device displays an abnormality on the display unit to display a maintenance warning of a specific portion, and also displays a maintenance notice by displaying the display unit of the slave device specified by the ID.

Technical solution 1 adopts a method for detecting communication abnormality by three units, namely a communication unit, a detection unit, and a determination unit, wherein the communication unit is used for receiving a plurality of data frames; second, a detection unit for detecting a time length between adjacent data frames among the plurality of data frames received by the communication unit; and thirdly, the determining unit is used for comparing the time length detected by the detecting unit with a preset time interval, when the time length is in the time interval, the communication of the communication unit is normal, otherwise, the communication of the communication unit is abnormal. However, the anomaly detection can detect only anomalies caused by time, and the applicable range is limited.

The technical scheme 2 adopts a real-time communication detection mode of sending heartbeat packets and detecting response packets thereof in a non-service period. And sending heartbeat packets according to a preset time interval and detecting real-time communication load parameters of corresponding heartbeat response packets during the period without service. The traffic load parameter may reflect the real-time traffic status of the ATM. The display of the state adopts a grading system, and the five states of normal congestion, light congestion, moderate congestion, severe congestion and no heartbeat are respectively provided. The current communication state can be obtained by comparing the communication load parameter with the requirement parameters of the above five states. And finally, outputting the communication state through a communication state output module. The method is simple and effective, but the standard form is single, the reference factor is not comprehensive enough, and the method also has limitations on the scale and conditions of the use environment.

Technical solution 3 adopts a master-slave device communication detection method in a communication system. The detection unit detects the communication voltage between the master device and at least one slave device, namely reads the rectangular wave lifting voltage value in a communication line between the master device and the slave device to detect the voltage; a notification unit that, when detecting that the communication voltage is equal to or lower than the set voltage or when the communication voltage is expected to become smaller than the set voltage, notifies the host device of identification information for identifying a communication destination of the communication in addition to the warning information; the master device includes a communication unit configured to notify the slave device identified by the identification information that warning information has been notified; the slave device includes a display unit that displays the notification of the warning information when the notification of the warning information is notified. This method can always detect or predict the occurrence of a failure in the drop of the communication voltage in the communication line between the master device and the slave device, but the detectable failure is caused only by the voltage and is relatively simple in functionality.

Disclosure of Invention

In view of the above situation, and in order to overcome the defects of the prior art, the present invention aims to provide an anomaly detection method for a communication network information system based on a distributed service log, which can automatically compare a real-time log stream with a development template by finding a comparison case during automatic anomaly detection by discovering a service benchmark in a network, thereby achieving the purpose of detecting anomalies.

The technical scheme for solving the problem is that the method comprises three steps of log analysis, TCFG generation and automatic abnormality detection, firstly, the mapping and the template of an original log and a template are obtained by analyzing the log, when the log template is mapped, the log template index, the timestamp and the log template are recorded, then the TCFG is generated, and finally, the online log stream is compared with the excavated TCFG model to finish the difference diagnosis, and the method is characterized in that the TCFG generation comprises the following three steps,

s1, CFG edge mining, wherein in CFG or TCFG, the node represents a log template; the node comprises a subset, a direct subset and a time weight set; the subset is a set of subsequent nodes in the time window; the direct subset is a set of direct inheritors of the node; the set of temporal weights is a set of time intervals between a node and its immediate child node; the set of temporal weights is a candidate set of edge weights;

there are two key points in step S1, one being time t, which spans the value selection of the time window; long periods of time with accompanying noise, while short periods of time miss some of the correct successor nodes; the other point is how to calculate the occurrence times of the child nodes; the rule is set to calculate the occurrence times of the child nodes only once in a time window;

when the occurrence number of the child node is calculated, the filtering rule is used for determining a direct subsequent node of the node; filtering rule 1 is intended to filter noise based on the number of occurrences of a child node; the formula is described below;

T_nrepresenting a template node; n is the occurrence frequency of the log template node;

representing child nodes, M_iIs the number of occurrences of the child node; setting a filtering threshold value of one child node as theta; if it is not

Will be filtered and deleted from the subset;

filter rule 2 is intended to select the immediate successor node according to the number of occurrences of the immediate child node; equation (2) is described below;

representing a direct child node, Q_iBeing a direct childThe number of occurrences; setting a filtering threshold eta of the direct subset; if it is not

Will be filtered and deleted from the immediate subset;

then, the direct subsequent node can be obtained using equation (3);

S_s＝S_d∩S_c(3)

S_s,S_d,S_crespectively representing a set, a subset and a direct subset of directly subsequent nodes;

for each template node, once the direct subsequent node is calculated, the edge of the node is confirmed; then, all nodes and their edges may constitute a CFG;

s2, converting flow boundary segmentation, and determining whether the time interval between adjacent nodes is stable or not by calculating variance; equations (4) and (5) are described below;

deleting an edge between two adjacent nodes when the variance of the time interval between the two adjacent nodes is greater than a threshold; in order to filter the noise that accompanies a long time in step S1, the value of the threshold β is set large;

then checking whether a circulation exists in the CFG; the method is to calculate the variance of all sides forming the cycle and delete the side with the maximum variance;

s3, calculating the time weight of each edge in the TCFG, wherein the time weight of each edge represents the execution time between two log template nodes; the time weight can be used for detecting the abnormity, a time weight algorithm based on K mean value is adopted, and the input is a time interval set D ═ t₁,t₂,……,t_mThe clustering number K is 3, and the output is the time weight;

the distribution of the time interval list of edges is very centralized; most elements are concentrated in one area, and other elements are dispersed in other areas; therefore, the time interval list is divided into 3 groups by adopting a K mean algorithm; the most numerous groups are adopted; the largest element in the set is the temporal weight;

the basis of the K-means algorithm is the sum of the squares of the minimum errors; the cost function (6) is:

wherein k represents a clustering grouping k, c_iSet of elements representing the i-th group, x being c_iBy any element in (1), it is meant a time interval,

represents a cluster c_iE is the variance.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages;

the technology is based on the discovery of network service reference, researches an automatic anomaly detection method, and provides the automatic anomaly detection method based on TCFG: first analyzing the original log to generate a template with a timestamp; then, a TCFG (time weighted control flow chart) is generated by analyzing the log of each service, which is responsible for recording the transition state of the system running in the service; in addition, according to the comparison TCFG, real-time log stream abnormity can be diagnosed; evaluation shows that the TCFG-based automatic anomaly detection method has good accuracy and recall rate, and in a real environment, the accuracy rate is 80% and the recall rate is 65% on average.

Drawings

Fig. 1 is a flowchart of an anomaly detection method for a communication network information system based on a distributed service log according to the present invention.

FIG. 2 is a TCFG mining diagram generated by a Hadoop log of the anomaly detection method for a communication network information system based on distributed service logs.

Fig. 3 is a TCFG mining diagram generated by the Zookeeper log of the distributed service log-based communication network information system anomaly detection method of the present invention.

Fig. 4 is a diagram illustrating the accuracy and recall rate of TCFG mining according to the method for detecting an anomaly in a communication network information system based on a distributed service log.

Fig. 5 is a diagram illustrating the accuracy and recall rate of anomaly detection in the anomaly detection method for a communication network information system based on distributed service logs according to the present invention.

Detailed Description

The foregoing and other technical matters, features and effects of the present invention will be apparent from the following detailed description of the embodiments, which is to be read in connection with the accompanying drawings of fig. 1 to 5. The structural contents mentioned in the following embodiments are all referred to the attached drawings of the specification.

The embodiment one, the communication network information system anomaly detection method based on distributed service log, including three steps of log analysis, TCFG generation and anomaly automatic detection, first, through analyzing the log (the log analysis uses the existing method, first, through preprocessing, each row of log divides words according to the blank space, and eliminates the number inside, then through clustering method divides the original log into a plurality of groups, finally, according to the similarity of a group of logs, extracts the log template, produces the mapping of the template and the log, obtains the mapping and template of the original log and template, when mapping the log template, records the log template index, the time stamp and the log template, then generates TCFG, finally compares the on-line log stream with the mined TCFG model, completes the difference diagnosis, characterized in that, the generation TCFG includes the following three steps,

there are two key points in step S1, one being time t, which spans the value selection of the time window; long times with noise and short times missing some correct successor nodes; the other point is how to calculate the occurrence times of the child nodes; the rule is set to calculate the occurrence times of the child nodes only once in a time window;

Will be filtered and deleted from the subset;

representing a direct child node, Q_iIs the number of occurrences of the direct child node; setting a filtering threshold eta of the direct subset; if it is not

Will be filtered and deleted from the immediate subset;

then, the direct subsequent node can be obtained using equation (3);

S_s＝S_d∩S_c(3)

s2, converting flow boundary segmentation, and determining whether the time interval between adjacent nodes is stable or not by calculating variance; the formula (4) and the formula (5) are as follows;

the basis of the K-means algorithm is the sum of the squares of the minimum errors; the cost function is:

wherein k represents a clustering grouping k, c_iSet of elements representing the ith group, also called x c_iBy any element in (1), it is meant a time interval,

represents a cluster c_iE is the variance.

When the method is used specifically, the TCFG mining and abnormity diagnosis are evaluated by using the laboratory environment log data set and the reality log data set, 70% of data are developed, 30% of data are used for evaluation test, and the details of the data are as follows:

hadoop log set: the laboratory environment is Hadoop2.0 version, two micro-services are operated, and an original log which is about 1.6GB for ten days is selected from one micro-service in the system as a data set;

zookeeper log set: the zookeeper system outputs about 20G logs each day based on the real world, and selects about 8GB original logs with eight hours as a data set from micro-services in the system;

the TCFG mining and anomaly diagnosis is evaluated by accuracy and recall, which are defined as follows.

True correct (TP) indicates that the correct case is considered correct, false correct (FP) indicates that the wrong case is considered correct, and false error (FN) indicates that the correct case is considered error.

Specific parameter settings are shown in table 1. The depth is the depth of the log search tree and the similarity threshold is the similarity threshold of the log template.

TABLE 1 parameter settings

FIGS. 2 and 3 are TCFGs mined from Hadoop and Zookeeper logs with bifurcated and linear sequences, followed by evaluation of TCFG and anomaly diagnosis efficiency using standard accuracy/recall indicators, first the patent manually generates a reference TCFG structure by source code, randomly adding some anomaly logs to the test set;

compared with the algorithm of the patent, the algorithm is an abnormity diagnosis method based on a mixed graph model (HGM) with a log, in the patent, in order to divide a boundary, the interval time between two tasks is set to be much longer than the execution time of each step in the tasks, so that a writer adopts the HGM in an experiment to highlight the advantage;

as can be seen from fig. 4, the accuracy of both algorithms is high, and the recall rate is lower than the accuracy. This is because the segmentation strategy cannot segment all boundaries, and in order to filter all noise, the threshold in the boundary segmentation strategy is set to be larger, and conversely, if the threshold is too small, normal edges are also filtered, and the algorithm of the present invention recalls 80% in Hadoop, which is higher than 75% of the HGM algorithm. Meanwhile, the algorithm recalls 75% in the Zookeeper and is higher than 65% of the HGM algorithm, so that the boundary segmentation strategy is superior to the HGM algorithm, and in addition, the superiority of the strategy is reflected in the abnormity diagnosis; as shown in fig. 5, the algorithm of this patent has 70% recall in Hadoop, which is higher than 65% of the HGM algorithm, and furthermore, the algorithm has 65% recall in Zookeeper, which is higher than 50% of the HGM algorithm. It can be concluded that the time-weighted algorithm based on K-means also plays a role.

While the invention has been described in further detail with reference to specific embodiments thereof, it is not intended that the invention be limited to the specific embodiments thereof; for those skilled in the art to which the present invention pertains and related technologies, the extension, operation method and data replacement should fall within the protection scope of the present invention based on the technical solution of the present invention.

Claims

1. A method for detecting the abnormity of communication network information system based on distributed service log includes such steps as analyzing log to obtain the mapping between original log and template and the template, mapping the log template, recording the index of log template, time stamp and log template, generating TCFG time control flow graph, comparing the on-line log flow with the mined TCFG time control flow graph model to complete the difference diagnosis,

s1, the CFG control flow graph is mined, and in the CFG control flow graph or the TCFG time control flow graph, nodes represent log templates; the node comprises a subset, a direct subset and a time weight set; the subset is a set of subsequent nodes in the time window; the direct subset is a set of direct inheritors of the node; the set of temporal weights is a set of time intervals between a node and its immediate child node; the set of temporal weights is a candidate set of edge weights;

when the occurrence number of the child node is calculated, the filtering rule is used for determining a direct subsequent node of the node; filtering rule 1 is intended to filter noise based on the number of occurrences of a child node; formula (1) is described below;

sub-section of the presentationDot, M_iIs the number of occurrences of the child node; setting a filtering threshold value of one child node as theta; if it is not

Will be filtered and deleted from the subset;

Will be filtered and deleted from the immediate subset;

then, the direct subsequent node can be obtained using equation (3);

S_s＝S_d∩S_c(3)

for each template node, once the direct subsequent node is calculated, the edge of the node is confirmed; then, all nodes and edges thereof can form a CFG control flow graph;

then checking whether a loop exists in the CFG control flow graph; the method is to calculate the variance of all sides forming the cycle and delete the side with the maximum variance;

s3, calculating edge time weight, wherein the time weight of each edge in the TCFG time control flow graph represents the execution time between two log template nodes; the time weight can be used for detecting the abnormity, a time weight algorithm based on K mean value is adopted, and the input is a time interval set D ═ t₁,t₂,……,t_mThe clustering number K is 3, and the output is the time weight;

represents a cluster c_iE is the variance.