CN110347827A

CN110347827A - Event Distillation method towards isomery text operation/maintenance data

Info

Publication number: CN110347827A
Application number: CN201910561157.4A
Authority: CN
Inventors: 徐建; 唐晓春; 傅媛媛; 蔡志成
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2019-10-18
Anticipated expiration: 2039-06-26
Also published as: CN110347827B

Abstract

The present invention provides a kind of Event Distillation methods towards isomery text operation/maintenance data, comprising the following steps: step 1, defines extensive type regular expression；Step 2, based on the type detection of regular expression；Step 3, the text operation/maintenance data cluster based on various dimensions similitude；Step 4, event type generates and text operation/maintenance data type marks.The purpose of Event Distillation method provided by the present invention is the texts class operation/maintenance datas such as the journal file, the work ticket that generate when running using complicated IT system as research object, a kind of Event Distillation method towards isomery text operation/maintenance data is provided, has the adaptability and higher accuracy of processing isomery text operation/maintenance data.

Description

Event extraction method for heterogeneous text operation and maintenance data

Technical Field

The invention relates to an event mining technology, in particular to an event extraction method for heterogeneous text operation and maintenance data.

Background

Event mining is crucial for system failure prediction, however, an acceptable logging standard does not exist, and therefore, how to quickly analyze log data from heterogeneous systems and other operation and maintenance data, such as work tickets and the like, is a very challenging problem.

Currently known log pattern discovery methods are mainly divided into two main categories: 1) a matching method based on regular expressions; 2) a method of pattern recognition based on clustering. Many companies have developed tools for log analysis, such as: splunk, logly, LogEntries, etc., and some open source software packages, such as: ElasticSearch, Graylog, OSSIM, etc., which mostly use regular expressions to match log data. The regular expression is utilized to analyze the log data, the log mode can be completely mined usually, however, a lot of prior knowledge and manual intervention are needed, the ability of learning knowledge from historical log data is not provided, and the method is not suitable for a large number of heterogeneous logs. Moreover, different regular expressions can only be used for specific systems, and are not flexible enough and cannot be expanded. In addition, the characteristics of complex writing process and easy generation of conflict of the regular expression also bring great difficulty to log analysis work, and especially, the efficiency of processing log data is reduced by excessively generalized regular expression rules. Therefore, the log is generally preprocessed by regular expressions, common types are marked, and then other clustering or pattern recognition algorithms are used for further analysis and mining, so that the precision and efficiency of log analysis can be remarkably improved on the premise of adding a small amount of priori knowledge. Cloya et al (cloya, guo lou. machine learning based log parsing system design and implementation [ J ] computer application, 2018,38(02): 352-. The LogSig algorithm is a log parsing method based on "signature" and refers to a most representative phrase structure in an event type as "signature". The algorithm groups all log data into k clusters and finds a log signature in each cluster so that all logs in a cluster match this signature as closely as possible. Since log text is typically short, once a signature appears, it can be classified accurately. Zhuge et al (Zhuge C, Vaarandi R. effective Event Log minimizing with LogCluster C [ C ]// Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and AndIEEE International Conference on Intelligent Data and Security (IDS),2017 IEEE3rd International Conference on IEEE,2017: 261-. LogCluster is essentially a frequent-word-based clustering algorithm, i.e., logs with the same frequent words will be clustered together. The method utilizes the characteristic of high inclined distribution of words in the log to perform clustering, and the characteristic is also applied to a plurality of log mining clustering algorithms. Makanju et al performed a series of work on log data analysis. An iterative clustering algorithm for logs, IPLoM, is proposed in the literature (Makanju A, Zincir-Heywood A N, Milios E E, et al, Spatio-temporal decomposition, clustering and clustering for alert detection in system locations [ C ]// proceeding of the 27th Annual ACM Symposium on Applied computing. ACM,2012: 621-: 1) aggregating logs of the same length together; 2) each cluster is divided by words with optimal information gain; 3) taking the word with the current best information gain for further division; 4) a final clustering result is generated based on the majority vote. Experiments show that the IPLoM is superior to other log clustering algorithms, but the IPLoM is easy to generate small clustering fragments without statistical significance, and the clustering quality is difficult to control. Because the final clustering result is related to the clustering effect of the first step, if the clustering effect of the first step is poor, the final clustering effect is difficult to satisfy. However, the IPLoM algorithm assumes that logs of the same length have the same format, and this problem makes the algorithm unsuitable for use in large amounts of heterogeneous log data. Wurzenberger et al (Wurzenberger M, Skopik F, Landauer M, et al. analytical clustering for semi-experimental and temporal detection applied log data [ C ]// Proceedings of the 12th International Conference on Availability, Reliability and security. ACM,2017:31-36.) propose a semi-supervised incremental clustering algorithm to cluster fast growing log data online, avoiding the need for recalculation each time a new log appears. Liu et al (Liu J, Li K, Li Y, et al. Attack Pattern Mining Algorithm Based on fuzzy clustering and Sequence Pattern from Security Log [ C ]// International Conference on Intelligent Information high and Multimedia Signal processing. Springer, Cham,2018:44-52.) studied attack Pattern Mining algorithms Based on improved fuzzy clustering and Sequence Pattern Mining. The method combines the advantages of fuzzy clustering to mine the similarity between the security logs and the advantages of the sequence mode, thereby discovering the logical relationship in the attack step, and experimental results show that the algorithm can effectively mine the attack mode. Xu et al (Xu C, Chen S, Cheng J. network user inter pattern minimal essential clustering algorithm [ C ]// Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC),2015 International Conference on. IEEE,2015:200-204) propose an algorithm for clustering web logs without self-defined parameters, the time complexity of the algorithm is O (n3), where n is the number of logs, the complexity is high, and cannot be extended to large data sets. Xia Ning et al (X.Ning and G.Jiang, "HLAer: A system for Heterogeneous log analysis," in Proceedings of the SDMWorkshop on Heterogeneous Loarning, 2014) have studied an unsupervised HLAer framework for automatically parsing Heterogeneous log data, which is robust to Heterogeneous logs but requires a large amount of memory overhead at runtime, and is therefore also not scalable. The common problem with the above algorithms or tools is that: cannot be extended to heterogeneous operation and maintenance data sets.

Disclosure of Invention

The invention aims to provide an event extraction method for heterogeneous text operation and maintenance data.

The technical scheme for realizing the purpose of the invention is as follows: an event extraction method for heterogeneous text operation and maintenance data comprises the following steps:

step 1, defining a generalization type regular expression: a group of regular expressions are predefined by using dates, time, IP addresses and assignment expressions to describe the dates, the time, the IP addresses and the assignment expressions appearing in the text operation and maintenance data, and a generalization representation type is associated with each regular expression;

step 2, detecting based on the type of the regular expression: preprocessing the given text operation and maintenance data by adopting a predefined regular expression, then detecting the type of each substring, identifying the date, time, IP address and assignment expression, and replacing specific variable values with generalized expression types of the substrings;

step 3, clustering the text operation and maintenance data based on multi-dimensional similarity: integrating three factors of grammar, structure and semantics of the text operation and maintenance data, defining similarity measurement of the text operation and maintenance data, and completing the division of the text operation and maintenance data by adopting a one-pass thought and a density-based clustering algorithm to form a text operation and maintenance data cluster;

step 4, event type generation and text operation and maintenance data type labeling: and generating event types represented by the clusters by adopting a manner of combining the operation and maintenance data in the clusters one by one, and associating each piece of text operation and maintenance data in the clusters with the event type corresponding to the cluster.

Further, the specific steps of step 1 are as follows:

step 1.1, a generalized type set T ═ DATE, TIME, IP address, and assignment expression is defined, where DATE denotes DATE information, TIME denotes TIME information, IP denotes internet address information, Exp denotes an assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";

and 1.2, associating a group of regular expressions for each generalization type T epsilon T to describe different expression forms which may appear in the text operation and maintenance data.

Further, the specific steps of step 2 are as follows:

step 2.1, dividing any each operation and maintenance data D into character strings formed by words by using marks such as spaces or symbols, wherein D belongs to D, and D is a set of the operation and maintenance data;

step 2.2, applying the defined regular expression set E to each substring s of the operation and maintenance data, if one substring s is a predefined example of any regular expression E, successfully generalizing the type of the substring s, and executing the step 2.2.1; otherwise, executing step 2.2.2; wherein s belongs to d, E belongs to E;

step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e;

step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and the regular expression library and the generalization expression type set thereof are updated;

and 2.3, D ═ D- { D }, if D | ≠ 0, indicating that the type detection is not finished, skipping to the step

Further, the specific steps of step 3 are as follows:

step 3.1, for any two pieces of text operation and maintenance data d₁,d₂E is e.g. D, has D₁＝p₁p₂...p_n，d₂＝q₁q₂...q_mWherein p is₁p₂...p_n，q₁q₂...q_mAre respectively d₁And d₂N is less than or equal to m;

step 3.2 define grammar similarity measure sim₁(d₁,d₂)

Wherein the content of the first and second substances,t(p_i)、t(q_i) Respectively represent operation and maintenance data d₁、d₂The regular expression type of the ith term or the ith word of (1);

step 3.3, defining a structural similarity metric sim₂(d₁,d₂)

sim₂(d₁,d₂)＝2|lcs(d₁,d₂)|-|d₂|

Wherein the function lcs () obtains the string d₁And d₂The longest common substring of;

step 3.4 definition of semantic similarityQuantity sim₃(d₁,d₂)

Wherein the function if (w) represents the word frequency, sim, of the word w_w(w,d₂) Representing word q and sentence d₂The maximum word similarity of the Chinese words,

sim_w(w,d₁)＝max{sim_w(w,p_i)|_i＝1,...,n}

sim_w(w,d₂)＝max{sim_w(w,q_j)|_j＝1,...,m}；

step 3.5, synthesize grammar, structure and semantic similarity measure, produce the overall similarity measure sim (d)₁,d₂)

Wherein, w_iThe weights representing the different measures of similarity are,

and 3.6, giving operation and maintenance data D, and finishing the division of the text operation and maintenance data by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster.

Further, the specific process of step 3.6 is:

step 3.6.1, define parameter d_maxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xd_maxSetting the cluster number as k, initializing k as 0, and recording the cluster set as C as { C }₁,c₂,...c_kIn which c is_kRepresents a cluster center;

step 3.6.2, processing the operation and maintenance data D in the step D one by one:

in step 3.6.2.1, if k is equal to 0, k + is equal to 1, and d is assigned to cluster c₁And d as a cluster c₁The center of (a);

at step 3.6.2.2, a similarity metric { sim (d, c) is calculated for d and each cluster center_i)|_i＝1,...,kIf there is a cluster c_iSatisfy min (sim (d, c)_i) Dmax ≦ d), then d is assigned to cluster c_iOtherwise, a new cluster c is created_k+1D is allocated to cluster c_k+1And d as a cluster c_k+1K + ═ 1;

step 3.6.3, D ═ D- { D }, if | D | ≠ 0, it indicates that the clustering process is not completed yet, then step 3.6.2 is skipped;

step 3.6.4, forming clustered cluster C.

Further, the specific steps of step 4 are as follows:

step 4.1, is an arbitrary cluster c_iGenerating an event type, wherein c_i∈C，c_i＝{d₁,d₂,...,d_g}，g＝|c_i|：

Step 4.2, for cluster c_iAny two pieces of operation and maintenance data d_x,d_y∈c_iX is more than or equal to 1, y is less than or equal to g, d'_i＝null；

Step 4.3, operation and maintenance data d_x、d_yAligning to obtain operation and maintenance data pairs d 'with equal length'_x、d'_y；

Step 4.4, merge d'_x、d'_yTo obtain d'_i

d'_i＝strcat(d'_i,f(d'_x(i),d'_y(i))|i＝1,...,l)

Wherein l ═ d'_xL, function strcat () is a string concatenation function,

type (×) indicates generalized Type;

step 4.5, c_i＝c_i-{d_x,d_y}，c_i＝c_i∪{d'_iIs a value of | c }_iIf the value is greater than 1, skipping to the step 4.2;

step 4.6, obtaining d'_iI.e. is a cluster c_iThe type of event of (2);

step 4.7, for arbitrary clusters c_i，c_i＝{d₁,d₂,...,d_g}，g＝|c_iL, each piece of operation and maintenance data in the cluster is marked with an event type of d'_i。

Compared with the prior art, the invention has the advantages that: (1) the invention provides an event extraction method for heterogeneous text operation and maintenance data by taking text operation and maintenance data such as log files, work tickets and the like generated when a complex IT system operates as research objects, and a specific event type is marked for each text operation and maintenance data; (2) the mode of realizing type detection by adopting the regular expression can improve the adaptability of processing heterogeneous text operation and maintenance data; (3) the method has the advantages that multi-dimensional similarity measurement is designed, the accuracy of event extraction can be improved, and particularly, the semantic similarity measurement can increase the measurement accuracy in a heterogeneous scene; (4) the one-pass clustering idea is applied, the event extraction efficiency can be improved, and the method is suitable for real-time processing scenes.

The present invention is described in further detail below with reference to the attached drawing figures.

Drawings

Fig. 1 is a flowchart of an event extraction method for heterogeneous text operation and maintenance data according to the present invention.

Fig. 2 is a schematic diagram of heterogeneous text operation and maintenance data.

Detailed Description

In the invention, one regular expression set is composed of a plurality of regular expressions. And applying each regular expression to the substring s of the operation and maintenance data to judge whether s meets the regular expression. A substring is a basic concept in a string of characters, representing a part of a given string that holds a word or alphabetical order, for example: if the character string abdfgd, then adf, ag, etc. are substrings, and gd is not a substring.

In the present invention, an example refers to a specific character string satisfying a regular expression, for example, if the regular expression representing the year is defined as "d {4 }", then "2018" and the like are examples.

With reference to fig. 1, an event extraction method for heterogeneous text operation and maintenance data includes the following steps:

step 1, defining a generalization type regular expression, and the process is as follows:

step 1.1, defining a generalized type set T ═ DATE, TIME, IP address, assignment expression, and other dimensions, where DATE represents DATE information, TIME represents TIME information, IP represents internet address information, Exp represents assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";

step 1.2, associating a group of regular expressions for describing different expression forms which may appear in the text operation and maintenance data for each generalization type T e T, such as an example of a type Date, which may be expressed as "2019-05-28", may also be expressed as "05-28-2019", may also be expressed as "2019.5.28", and the like; the corresponding regular expression is defined as the set of all regular expressions

E＝{((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\,？\s+\d{4})|\d{4}\-[0-1]\d\-[0-3]\d}；

Step 2, with reference to fig. 2, based on the type detection of the regular expression, the process is as follows:

step 2.1, preprocessing any each piece of operation and maintenance data D, D ∈ D, and marking each piece of data by using a space or a symbol;

step 2.2, applying the regular expression set E defined in the step 1 to each substring s E d of the operation and maintenance data, if a substring s is a predefined example of any regular expression E, E E represents that the type generalization of the substring s is successful, executing the step 2.2.1, otherwise executing the step 2.2.2;

step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e; for example, DATE for 'Feb 26,2016', TIME for '4: 05:26 PM';

step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and a regular expression library and a generalization expression type set thereof are updated by depending on a domain expert;

step 2.3, D ═ D- { D }, if | D | ≠ 0, indicating that the type detection is not completed, then jumping to step 2.1;

step 3, clustering the text operation and maintenance data based on multi-dimensional similarity, wherein the process is as follows:

and 3.1, calculating the similarity of any two text operation and maintenance data. For any two pieces of text operation and maintenance data d₁,d₂E.g. D, expressed as D₁＝p₁p₂...p_n,d₂＝q₁q₂...q_mWherein p is₁p₂...p_n，q₁q₂...q_mAre respectively d₁And d₂Without loss of generality n ≦ m.

Step 3.1.1, defining grammar similarity measuresWhereinAlpha is a user-defined parameter, alpha is more than or equal to 0 and less than or equal to 1, and t (p)_i)、t(q_i) Respectively represent operation and maintenance data d₁、d₂The ith term or the ith word of (1).

Step 3.1.2, define the structural similarity metric sim₂(d₁,d₂)＝2|lcs(d₁,d₂)|-|d₂L, where the function lcs () obtains the string d₁And d₂The longest common substring of;

step 3.1.3, defining semantic similarity measure sim₃(d₁,d₂)

sim_w(w,d₁)＝max{sim_w(w,p_i)|_i＝1,...,n}

sim_w(w,d₂)＝max{sim_w(w,q_j)|_j＝1,...,m}；

step 3.1.4, synthesize grammar, structure and semantic similarity measure, produce the comprehensive similarity measureWherein w_iThe weights representing the different measures of similarity are,

step 3.2, giving operation and maintenance data D, and finishing text operation and maintenance data division by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster;

and 3.2.1, initializing parameters. Defining a parameter d_maxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xd_max(ii) a Setting the cluster number as k, k as 0, and the cluster set as C as { C }₁,c₂,...c_kIn which c is_kRepresents a cluster center;

step 3.2.2, processing the operation and maintenance data D in the D one by one,

step 3.2.2.1, if k is 0, then k + ═ 1, assigning d to cluster c₁And d as a cluster c₁The center of (a);

step 3.2.2.2, calculate the similarity metric { sim (d, c) of d to each cluster center_i)|_i＝1,...,kIf there is a cluster c_iSatisfy min (sim (d, c)_i) Dmax ≦ d), then d is assigned to cluster c_iOtherwise, a new cluster c is created_k+1D is allocated to cluster c_k+1And d as a cluster c_k+1K + ═ 1;

3.2.3, if D ≠ D- { D }, if | D | ≠ 0, which indicates that the clustering process is not completed yet, then jumping to the step 3.2.2;

and 3.2.4, forming a clustered cluster C.

Step 4, generating event types and labeling operation and maintenance data types, wherein the process is as follows:

step 4.1, is an arbitrary cluster c_i∈C，c_i＝{d₁,d₂,...,d_g}，g＝|c_iGenerating an event type;

step 4.1.1, for cluster c_iAny two pieces of operation and maintenance data d_x,d_y∈c_iX is more than or equal to 1, y is less than or equal to g, d'_i＝null；

Step 4.1.1.1, applying Smith-Waterman algorithm to convert operation and maintenance data d_x、d_yAligning to obtain operation and maintenance data pairs d 'with equal length'_x、d'_y，ll＝|d'_x|

Step 4.1.1.2, merge d'_x、d'_yTo obtain d'_i，

d'_i＝strcat(d'_i,f(d'_x(i),d'_y(i))|i＝1,...,l)

Where function strcat () is a string join function

Where Type (#) denotes a generalized Type.

Step 4.1.1.3, c_i＝c_i-{d_x,d_y}，c_i＝c_i∪{d'_iIs a value of | c }_iIf the value is greater than 1, skipping to the step 4.1.1;

step 4.1.1.4 to obtain d'_iI.e. is a cluster c_iThe type of event of (2);

step 4.2, for any cluster c_i＝{d₁,d₂,...,d_g}，g＝|c_iD 'is marked to each operation and maintenance data in the I cluster by event type'_i。

Claims

1. An event extraction method for heterogeneous text operation and maintenance data is characterized by comprising the following steps:

2. The method according to claim 1, wherein the specific steps of step 1 are as follows:

3. The method according to claim 1, wherein the specific steps of step 2 are as follows:

and 2.3, D ═ D- { D }, and if | D | ≠ 0, which indicates that the type detection is not finished, jumping to the step 2.1.

4. The method according to claim 1, wherein the specific steps of step 3 are as follows:

step 3.1, for any two pieces of text operation and maintenance data d₁,d₂E is e.g. D, has D₁＝p₁p₂...p_n，d₂＝q₁q₂...q_m, wherein ,p₁p₂...p_n，q₁q₂...q_mAre respectively d₁ and d₂N is less than or equal to m;

step 3.2 define grammar similarity measure sim₁(d₁,d₂)

wherein ,t(p_i)、t(q_i) Respectively represent operation and maintenance data d₁、d₂Item i or iA regular expression type of an individual word;

step 3.3, defining a structural similarity metric sim₂(d₁,d₂)

sim₂(d₁,d₂)＝2|lcs(d₁,d₂)|-|d₂|

Wherein the function lcs () obtains the string d₁ and d₂The longest common substring of;

step 3.4 defining semantic similarity measure sim₃(d₁,d₂)

sim_w(w,d₁)＝max{sim_w(w,p_i)|_i＝1,...,n}

sim_w(w,d₂)＝max{sim_w(w,q_j)|_j＝1,...,m}；

wherein ,w_iThe weights representing the different measures of similarity are,

5. The method according to claim 4, wherein the specific process of step 3.6 is as follows:

step 3.6.1, define parameter d_maxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xd_maxSetting the cluster number as k, initializing k as 0, and recording the cluster set as C as { C }₁,c₂,...c_k}, wherein c_kRepresents a cluster center;

step 3.6.4, forming clustered cluster C.

6. The method according to claim 5, wherein the specific steps of step 4 are as follows:

Step 4.4, merge d'_x、d'_yTo obtain d'_i

d'_i＝strcat(d'_i,f(d'_x(i),d'_y(i))|i＝1,...,l)

Wherein l ═ d'_xL, function strcat () is a string concatenation function,

type (×) indicates generalized Type;

step 4.6, obtaining d'_iI.e. is a cluster c_iThe type of event of (2);