CN110347827A - Event Distillation method towards isomery text operation/maintenance data - Google Patents

Event Distillation method towards isomery text operation/maintenance data Download PDF

Info

Publication number
CN110347827A
CN110347827A CN201910561157.4A CN201910561157A CN110347827A CN 110347827 A CN110347827 A CN 110347827A CN 201910561157 A CN201910561157 A CN 201910561157A CN 110347827 A CN110347827 A CN 110347827A
Authority
CN
China
Prior art keywords
maintenance data
cluster
type
sim
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910561157.4A
Other languages
Chinese (zh)
Other versions
CN110347827B (en
Inventor
徐建
唐晓春
傅媛媛
蔡志成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN201910561157.4A priority Critical patent/CN110347827B/en
Publication of CN110347827A publication Critical patent/CN110347827A/en
Application granted granted Critical
Publication of CN110347827B publication Critical patent/CN110347827B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The present invention provides a kind of Event Distillation methods towards isomery text operation/maintenance data, comprising the following steps: step 1, defines extensive type regular expression;Step 2, based on the type detection of regular expression;Step 3, the text operation/maintenance data cluster based on various dimensions similitude;Step 4, event type generates and text operation/maintenance data type marks.The purpose of Event Distillation method provided by the present invention is the texts class operation/maintenance datas such as the journal file, the work ticket that generate when running using complicated IT system as research object, a kind of Event Distillation method towards isomery text operation/maintenance data is provided, has the adaptability and higher accuracy of processing isomery text operation/maintenance data.

Description

Event extraction method for heterogeneous text operation and maintenance data
Technical Field
The invention relates to an event mining technology, in particular to an event extraction method for heterogeneous text operation and maintenance data.
Background
Event mining is crucial for system failure prediction, however, an acceptable logging standard does not exist, and therefore, how to quickly analyze log data from heterogeneous systems and other operation and maintenance data, such as work tickets and the like, is a very challenging problem.
Currently known log pattern discovery methods are mainly divided into two main categories: 1) a matching method based on regular expressions; 2) a method of pattern recognition based on clustering. Many companies have developed tools for log analysis, such as: splunk, logly, LogEntries, etc., and some open source software packages, such as: ElasticSearch, Graylog, OSSIM, etc., which mostly use regular expressions to match log data. The regular expression is utilized to analyze the log data, the log mode can be completely mined usually, however, a lot of prior knowledge and manual intervention are needed, the ability of learning knowledge from historical log data is not provided, and the method is not suitable for a large number of heterogeneous logs. Moreover, different regular expressions can only be used for specific systems, and are not flexible enough and cannot be expanded. In addition, the characteristics of complex writing process and easy generation of conflict of the regular expression also bring great difficulty to log analysis work, and especially, the efficiency of processing log data is reduced by excessively generalized regular expression rules. Therefore, the log is generally preprocessed by regular expressions, common types are marked, and then other clustering or pattern recognition algorithms are used for further analysis and mining, so that the precision and efficiency of log analysis can be remarkably improved on the premise of adding a small amount of priori knowledge. Cloya et al (cloya, guo lou. machine learning based log parsing system design and implementation [ J ] computer application, 2018,38(02): 352-. The LogSig algorithm is a log parsing method based on "signature" and refers to a most representative phrase structure in an event type as "signature". The algorithm groups all log data into k clusters and finds a log signature in each cluster so that all logs in a cluster match this signature as closely as possible. Since log text is typically short, once a signature appears, it can be classified accurately. Zhuge et al (Zhuge C, Vaarandi R. effective Event Log minimizing with LogCluster C [ C ]// Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and AndIEEE International Conference on Intelligent Data and Security (IDS),2017 IEEE3rd International Conference on IEEE,2017: 261-. LogCluster is essentially a frequent-word-based clustering algorithm, i.e., logs with the same frequent words will be clustered together. The method utilizes the characteristic of high inclined distribution of words in the log to perform clustering, and the characteristic is also applied to a plurality of log mining clustering algorithms. Makanju et al performed a series of work on log data analysis. An iterative clustering algorithm for logs, IPLoM, is proposed in the literature (Makanju A, Zincir-Heywood A N, Milios E E, et al, Spatio-temporal decomposition, clustering and clustering for alert detection in system locations [ C ]// proceeding of the 27th Annual ACM Symposium on Applied computing. ACM,2012: 621-: 1) aggregating logs of the same length together; 2) each cluster is divided by words with optimal information gain; 3) taking the word with the current best information gain for further division; 4) a final clustering result is generated based on the majority vote. Experiments show that the IPLoM is superior to other log clustering algorithms, but the IPLoM is easy to generate small clustering fragments without statistical significance, and the clustering quality is difficult to control. Because the final clustering result is related to the clustering effect of the first step, if the clustering effect of the first step is poor, the final clustering effect is difficult to satisfy. However, the IPLoM algorithm assumes that logs of the same length have the same format, and this problem makes the algorithm unsuitable for use in large amounts of heterogeneous log data. Wurzenberger et al (Wurzenberger M, Skopik F, Landauer M, et al. analytical clustering for semi-experimental and temporal detection applied log data [ C ]// Proceedings of the 12th International Conference on Availability, Reliability and security. ACM,2017:31-36.) propose a semi-supervised incremental clustering algorithm to cluster fast growing log data online, avoiding the need for recalculation each time a new log appears. Liu et al (Liu J, Li K, Li Y, et al. Attack Pattern Mining Algorithm Based on fuzzy clustering and Sequence Pattern from Security Log [ C ]// International Conference on Intelligent Information high and Multimedia Signal processing. Springer, Cham,2018:44-52.) studied attack Pattern Mining algorithms Based on improved fuzzy clustering and Sequence Pattern Mining. The method combines the advantages of fuzzy clustering to mine the similarity between the security logs and the advantages of the sequence mode, thereby discovering the logical relationship in the attack step, and experimental results show that the algorithm can effectively mine the attack mode. Xu et al (Xu C, Chen S, Cheng J. network user inter pattern minimal essential clustering algorithm [ C ]// Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC),2015 International Conference on. IEEE,2015:200-204) propose an algorithm for clustering web logs without self-defined parameters, the time complexity of the algorithm is O (n3), where n is the number of logs, the complexity is high, and cannot be extended to large data sets. Xia Ning et al (X.Ning and G.Jiang, "HLAer: A system for Heterogeneous log analysis," in Proceedings of the SDMWorkshop on Heterogeneous Loarning, 2014) have studied an unsupervised HLAer framework for automatically parsing Heterogeneous log data, which is robust to Heterogeneous logs but requires a large amount of memory overhead at runtime, and is therefore also not scalable. The common problem with the above algorithms or tools is that: cannot be extended to heterogeneous operation and maintenance data sets.
Disclosure of Invention
The invention aims to provide an event extraction method for heterogeneous text operation and maintenance data.
The technical scheme for realizing the purpose of the invention is as follows: an event extraction method for heterogeneous text operation and maintenance data comprises the following steps:
step 1, defining a generalization type regular expression: a group of regular expressions are predefined by using dates, time, IP addresses and assignment expressions to describe the dates, the time, the IP addresses and the assignment expressions appearing in the text operation and maintenance data, and a generalization representation type is associated with each regular expression;
step 2, detecting based on the type of the regular expression: preprocessing the given text operation and maintenance data by adopting a predefined regular expression, then detecting the type of each substring, identifying the date, time, IP address and assignment expression, and replacing specific variable values with generalized expression types of the substrings;
step 3, clustering the text operation and maintenance data based on multi-dimensional similarity: integrating three factors of grammar, structure and semantics of the text operation and maintenance data, defining similarity measurement of the text operation and maintenance data, and completing the division of the text operation and maintenance data by adopting a one-pass thought and a density-based clustering algorithm to form a text operation and maintenance data cluster;
step 4, event type generation and text operation and maintenance data type labeling: and generating event types represented by the clusters by adopting a manner of combining the operation and maintenance data in the clusters one by one, and associating each piece of text operation and maintenance data in the clusters with the event type corresponding to the cluster.
Further, the specific steps of step 1 are as follows:
step 1.1, a generalized type set T ═ DATE, TIME, IP address, and assignment expression is defined, where DATE denotes DATE information, TIME denotes TIME information, IP denotes internet address information, Exp denotes an assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";
and 1.2, associating a group of regular expressions for each generalization type T epsilon T to describe different expression forms which may appear in the text operation and maintenance data.
Further, the specific steps of step 2 are as follows:
step 2.1, dividing any each operation and maintenance data D into character strings formed by words by using marks such as spaces or symbols, wherein D belongs to D, and D is a set of the operation and maintenance data;
step 2.2, applying the defined regular expression set E to each substring s of the operation and maintenance data, if one substring s is a predefined example of any regular expression E, successfully generalizing the type of the substring s, and executing the step 2.2.1; otherwise, executing step 2.2.2; wherein s belongs to d, E belongs to E;
step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e;
step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and the regular expression library and the generalization expression type set thereof are updated;
and 2.3, D ═ D- { D }, if D | ≠ 0, indicating that the type detection is not finished, skipping to the step
Further, the specific steps of step 3 are as follows:
step 3.1, for any two pieces of text operation and maintenance data d1,d2E is e.g. D, has D1=p1p2...pn,d2=q1q2...qmWherein p is1p2...pn,q1q2...qmAre respectively d1And d2N is less than or equal to m;
step 3.2 define grammar similarity measure sim1(d1,d2)
Wherein the content of the first and second substances,t(pi)、t(qi) Respectively represent operation and maintenance data d1、d2The regular expression type of the ith term or the ith word of (1);
step 3.3, defining a structural similarity metric sim2(d1,d2)
sim2(d1,d2)=2|lcs(d1,d2)|-|d2|
Wherein the function lcs () obtains the string d1And d2The longest common substring of;
step 3.4 definition of semantic similarityQuantity sim3(d1,d2)
Wherein the function if (w) represents the word frequency, sim, of the word ww(w,d2) Representing word q and sentence d2The maximum word similarity of the Chinese words,
simw(w,d1)=max{simw(w,pi)|i=1,...,n}
simw(w,d2)=max{simw(w,qj)|j=1,...,m};
step 3.5, synthesize grammar, structure and semantic similarity measure, produce the overall similarity measure sim (d)1,d2)
Wherein, wiThe weights representing the different measures of similarity are,
and 3.6, giving operation and maintenance data D, and finishing the division of the text operation and maintenance data by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster.
Further, the specific process of step 3.6 is:
step 3.6.1, define parameter dmaxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xdmaxSetting the cluster number as k, initializing k as 0, and recording the cluster set as C as { C }1,c2,...ckIn which c iskRepresents a cluster center;
step 3.6.2, processing the operation and maintenance data D in the step D one by one:
in step 3.6.2.1, if k is equal to 0, k + is equal to 1, and d is assigned to cluster c1And d as a cluster c1The center of (a);
at step 3.6.2.2, a similarity metric { sim (d, c) is calculated for d and each cluster centeri)|i=1,...,kIf there is a cluster ciSatisfy min (sim (d, c)i) Dmax ≦ d), then d is assigned to cluster ciOtherwise, a new cluster c is createdk+1D is allocated to cluster ck+1And d as a cluster ck+1K + ═ 1;
step 3.6.3, D ═ D- { D }, if | D | ≠ 0, it indicates that the clustering process is not completed yet, then step 3.6.2 is skipped;
step 3.6.4, forming clustered cluster C.
Further, the specific steps of step 4 are as follows:
step 4.1, is an arbitrary cluster ciGenerating an event type, wherein ci∈C,ci={d1,d2,...,dg},g=|ci|:
Step 4.2, for cluster ciAny two pieces of operation and maintenance data dx,dy∈ciX is more than or equal to 1, y is less than or equal to g, d'i=null;
Step 4.3, operation and maintenance data dx、dyAligning to obtain operation and maintenance data pairs d 'with equal length'x、d'y
Step 4.4, merge d'x、d'yTo obtain d'i
d'i=strcat(d'i,f(d'x(i),d'y(i))|i=1,...,l)
Wherein l ═ d'xL, function strcat () is a string concatenation function,
type (×) indicates generalized Type;
step 4.5, ci=ci-{dx,dy},ci=ci∪{d'iIs a value of | c }iIf the value is greater than 1, skipping to the step 4.2;
step 4.6, obtaining d'iI.e. is a cluster ciThe type of event of (2);
step 4.7, for arbitrary clusters ci,ci={d1,d2,...,dg},g=|ciL, each piece of operation and maintenance data in the cluster is marked with an event type of d'i
Compared with the prior art, the invention has the advantages that: (1) the invention provides an event extraction method for heterogeneous text operation and maintenance data by taking text operation and maintenance data such as log files, work tickets and the like generated when a complex IT system operates as research objects, and a specific event type is marked for each text operation and maintenance data; (2) the mode of realizing type detection by adopting the regular expression can improve the adaptability of processing heterogeneous text operation and maintenance data; (3) the method has the advantages that multi-dimensional similarity measurement is designed, the accuracy of event extraction can be improved, and particularly, the semantic similarity measurement can increase the measurement accuracy in a heterogeneous scene; (4) the one-pass clustering idea is applied, the event extraction efficiency can be improved, and the method is suitable for real-time processing scenes.
The present invention is described in further detail below with reference to the attached drawing figures.
Drawings
Fig. 1 is a flowchart of an event extraction method for heterogeneous text operation and maintenance data according to the present invention.
Fig. 2 is a schematic diagram of heterogeneous text operation and maintenance data.
Detailed Description
In the invention, one regular expression set is composed of a plurality of regular expressions. And applying each regular expression to the substring s of the operation and maintenance data to judge whether s meets the regular expression. A substring is a basic concept in a string of characters, representing a part of a given string that holds a word or alphabetical order, for example: if the character string abdfgd, then adf, ag, etc. are substrings, and gd is not a substring.
In the present invention, an example refers to a specific character string satisfying a regular expression, for example, if the regular expression representing the year is defined as "d {4 }", then "2018" and the like are examples.
With reference to fig. 1, an event extraction method for heterogeneous text operation and maintenance data includes the following steps:
step 1, defining a generalization type regular expression, and the process is as follows:
step 1.1, defining a generalized type set T ═ DATE, TIME, IP address, assignment expression, and other dimensions, where DATE represents DATE information, TIME represents TIME information, IP represents internet address information, Exp represents assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";
step 1.2, associating a group of regular expressions for describing different expression forms which may appear in the text operation and maintenance data for each generalization type T e T, such as an example of a type Date, which may be expressed as "2019-05-28", may also be expressed as "05-28-2019", may also be expressed as "2019.5.28", and the like; the corresponding regular expression is defined as the set of all regular expressions
E={((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\,?\s+\d{4})|\d{4}\-[0-1]\d\-[0-3]\d};
Step 2, with reference to fig. 2, based on the type detection of the regular expression, the process is as follows:
step 2.1, preprocessing any each piece of operation and maintenance data D, D ∈ D, and marking each piece of data by using a space or a symbol;
step 2.2, applying the regular expression set E defined in the step 1 to each substring s E d of the operation and maintenance data, if a substring s is a predefined example of any regular expression E, E E represents that the type generalization of the substring s is successful, executing the step 2.2.1, otherwise executing the step 2.2.2;
step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e; for example, DATE for 'Feb 26,2016', TIME for '4: 05:26 PM';
step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and a regular expression library and a generalization expression type set thereof are updated by depending on a domain expert;
step 2.3, D ═ D- { D }, if | D | ≠ 0, indicating that the type detection is not completed, then jumping to step 2.1;
step 3, clustering the text operation and maintenance data based on multi-dimensional similarity, wherein the process is as follows:
and 3.1, calculating the similarity of any two text operation and maintenance data. For any two pieces of text operation and maintenance data d1,d2E.g. D, expressed as D1=p1p2...pn,d2=q1q2...qmWherein p is1p2...pn,q1q2...qmAre respectively d1And d2Without loss of generality n ≦ m.
Step 3.1.1, defining grammar similarity measuresWhereinAlpha is a user-defined parameter, alpha is more than or equal to 0 and less than or equal to 1, and t (p)i)、t(qi) Respectively represent operation and maintenance data d1、d2The ith term or the ith word of (1).
Step 3.1.2, define the structural similarity metric sim2(d1,d2)=2|lcs(d1,d2)|-|d2L, where the function lcs () obtains the string d1And d2The longest common substring of;
step 3.1.3, defining semantic similarity measure sim3(d1,d2)
Wherein the function if (w) represents the word frequency, sim, of the word ww(w,d2) Representing word q and sentence d2The maximum word similarity of the Chinese words,
simw(w,d1)=max{simw(w,pi)|i=1,...,n}
simw(w,d2)=max{simw(w,qj)|j=1,...,m};
step 3.1.4, synthesize grammar, structure and semantic similarity measure, produce the comprehensive similarity measureWherein wiThe weights representing the different measures of similarity are,
step 3.2, giving operation and maintenance data D, and finishing text operation and maintenance data division by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster;
and 3.2.1, initializing parameters. Defining a parameter dmaxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xdmax(ii) a Setting the cluster number as k, k as 0, and the cluster set as C as { C }1,c2,...ckIn which c iskRepresents a cluster center;
step 3.2.2, processing the operation and maintenance data D in the D one by one,
step 3.2.2.1, if k is 0, then k + ═ 1, assigning d to cluster c1And d as a cluster c1The center of (a);
step 3.2.2.2, calculate the similarity metric { sim (d, c) of d to each cluster centeri)|i=1,...,kIf there is a cluster ciSatisfy min (sim (d, c)i) Dmax ≦ d), then d is assigned to cluster ciOtherwise, a new cluster c is createdk+1D is allocated to cluster ck+1And d as a cluster ck+1K + ═ 1;
3.2.3, if D ≠ D- { D }, if | D | ≠ 0, which indicates that the clustering process is not completed yet, then jumping to the step 3.2.2;
and 3.2.4, forming a clustered cluster C.
Step 4, generating event types and labeling operation and maintenance data types, wherein the process is as follows:
step 4.1, is an arbitrary cluster ci∈C,ci={d1,d2,...,dg},g=|ciGenerating an event type;
step 4.1.1, for cluster ciAny two pieces of operation and maintenance data dx,dy∈ciX is more than or equal to 1, y is less than or equal to g, d'i=null;
Step 4.1.1.1, applying Smith-Waterman algorithm to convert operation and maintenance data dx、dyAligning to obtain operation and maintenance data pairs d 'with equal length'x、d'y,ll=|d'x|
Step 4.1.1.2, merge d'x、d'yTo obtain d'i
d'i=strcat(d'i,f(d'x(i),d'y(i))|i=1,...,l)
Where function strcat () is a string join function
Where Type (#) denotes a generalized Type.
Step 4.1.1.3, ci=ci-{dx,dy},ci=ci∪{d'iIs a value of | c }iIf the value is greater than 1, skipping to the step 4.1.1;
step 4.1.1.4 to obtain d'iI.e. is a cluster ciThe type of event of (2);
step 4.2, for any cluster ci={d1,d2,...,dg},g=|ciD 'is marked to each operation and maintenance data in the I cluster by event type'i

Claims (6)

1. An event extraction method for heterogeneous text operation and maintenance data is characterized by comprising the following steps:
step 1, defining a generalization type regular expression: a group of regular expressions are predefined by using dates, time, IP addresses and assignment expressions to describe the dates, the time, the IP addresses and the assignment expressions appearing in the text operation and maintenance data, and a generalization representation type is associated with each regular expression;
step 2, detecting based on the type of the regular expression: preprocessing the given text operation and maintenance data by adopting a predefined regular expression, then detecting the type of each substring, identifying the date, time, IP address and assignment expression, and replacing specific variable values with generalized expression types of the substrings;
step 3, clustering the text operation and maintenance data based on multi-dimensional similarity: integrating three factors of grammar, structure and semantics of the text operation and maintenance data, defining similarity measurement of the text operation and maintenance data, and completing the division of the text operation and maintenance data by adopting a one-pass thought and a density-based clustering algorithm to form a text operation and maintenance data cluster;
step 4, event type generation and text operation and maintenance data type labeling: and generating event types represented by the clusters by adopting a manner of combining the operation and maintenance data in the clusters one by one, and associating each piece of text operation and maintenance data in the clusters with the event type corresponding to the cluster.
2. The method according to claim 1, wherein the specific steps of step 1 are as follows:
step 1.1, a generalized type set T ═ DATE, TIME, IP address, and assignment expression is defined, where DATE denotes DATE information, TIME denotes TIME information, IP denotes internet address information, Exp denotes an assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";
and 1.2, associating a group of regular expressions for each generalization type T epsilon T to describe different expression forms which may appear in the text operation and maintenance data.
3. The method according to claim 1, wherein the specific steps of step 2 are as follows:
step 2.1, dividing any each operation and maintenance data D into character strings formed by words by using marks such as spaces or symbols, wherein D belongs to D, and D is a set of the operation and maintenance data;
step 2.2, applying the defined regular expression set E to each substring s of the operation and maintenance data, if one substring s is a predefined example of any regular expression E, successfully generalizing the type of the substring s, and executing the step 2.2.1; otherwise, executing step 2.2.2; wherein s belongs to d, E belongs to E;
step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e;
step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and the regular expression library and the generalization expression type set thereof are updated;
and 2.3, D ═ D- { D }, and if | D | ≠ 0, which indicates that the type detection is not finished, jumping to the step 2.1.
4. The method according to claim 1, wherein the specific steps of step 3 are as follows:
step 3.1, for any two pieces of text operation and maintenance data d1,d2E is e.g. D, has D1=p1p2...pn,d2=q1q2...qm, wherein ,p1p2...pn,q1q2...qmAre respectively d1 and d2N is less than or equal to m;
step 3.2 define grammar similarity measure sim1(d1,d2)
wherein ,t(pi)、t(qi) Respectively represent operation and maintenance data d1、d2Item i or iA regular expression type of an individual word;
step 3.3, defining a structural similarity metric sim2(d1,d2)
sim2(d1,d2)=2|lcs(d1,d2)|-|d2|
Wherein the function lcs () obtains the string d1 and d2The longest common substring of;
step 3.4 defining semantic similarity measure sim3(d1,d2)
Wherein the function if (w) represents the word frequency, sim, of the word ww(w,d2) Representing word q and sentence d2The maximum word similarity of the Chinese words,
simw(w,d1)=max{simw(w,pi)|i=1,...,n}
simw(w,d2)=max{simw(w,qj)|j=1,...,m};
step 3.5, synthesize grammar, structure and semantic similarity measure, produce the overall similarity measure sim (d)1,d2)
wherein ,wiThe weights representing the different measures of similarity are,
and 3.6, giving operation and maintenance data D, and finishing the division of the text operation and maintenance data by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster.
5. The method according to claim 4, wherein the specific process of step 3.6 is as follows:
step 3.6.1, define parameter dmaxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xdmaxSetting the cluster number as k, initializing k as 0, and recording the cluster set as C as { C }1,c2,...ck}, wherein ckRepresents a cluster center;
step 3.6.2, processing the operation and maintenance data D in the step D one by one:
in step 3.6.2.1, if k is equal to 0, k + is equal to 1, and d is assigned to cluster c1And d as a cluster c1The center of (a);
at step 3.6.2.2, a similarity metric { sim (d, c) is calculated for d and each cluster centeri)|i=1,...,kIf there is a cluster ciSatisfy min (sim (d, c)i) Dmax ≦ d), then d is assigned to cluster ciOtherwise, a new cluster c is createdk+1D is allocated to cluster ck+1And d as a cluster ck+1K + ═ 1;
step 3.6.3, D ═ D- { D }, if | D | ≠ 0, it indicates that the clustering process is not completed yet, then step 3.6.2 is skipped;
step 3.6.4, forming clustered cluster C.
6. The method according to claim 5, wherein the specific steps of step 4 are as follows:
step 4.1, is an arbitrary cluster ciGenerating an event type, wherein ci∈C,ci={d1,d2,...,dg},g=|ci|:
Step 4.2, for cluster ciAny two pieces of operation and maintenance data dx,dy∈ciX is more than or equal to 1, y is less than or equal to g, d'i=null;
Step 4.3, operation and maintenance data dx、dyAligning to obtain operation and maintenance data pairs d 'with equal length'x、d'y
Step 4.4, merge d'x、d'yTo obtain d'i
d'i=strcat(d'i,f(d'x(i),d'y(i))|i=1,...,l)
Wherein l ═ d'xL, function strcat () is a string concatenation function,
type (×) indicates generalized Type;
step 4.5, ci=ci-{dx,dy},ci=ci∪{d'iIs a value of | c }iIf the value is greater than 1, skipping to the step 4.2;
step 4.6, obtaining d'iI.e. is a cluster ciThe type of event of (2);
step 4.7, for arbitrary clusters ci,ci={d1,d2,...,dg},g=|ciL, each piece of operation and maintenance data in the cluster is marked with an event type of d'i
CN201910561157.4A 2019-06-26 2019-06-26 Event Extraction Method for Heterogeneous Text Operation and Maintenance Data Active CN110347827B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910561157.4A CN110347827B (en) 2019-06-26 2019-06-26 Event Extraction Method for Heterogeneous Text Operation and Maintenance Data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910561157.4A CN110347827B (en) 2019-06-26 2019-06-26 Event Extraction Method for Heterogeneous Text Operation and Maintenance Data

Publications (2)

Publication Number Publication Date
CN110347827A true CN110347827A (en) 2019-10-18
CN110347827B CN110347827B (en) 2023-08-22

Family

ID=68183197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910561157.4A Active CN110347827B (en) 2019-06-26 2019-06-26 Event Extraction Method for Heterogeneous Text Operation and Maintenance Data

Country Status (1)

Country Link
CN (1) CN110347827B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143312A (en) * 2019-12-24 2020-05-12 广东电科院能源技术有限责任公司 Format analysis method, device, equipment and storage medium for power logs
CN113742116A (en) * 2020-11-27 2021-12-03 北京沃东天骏信息技术有限公司 Abnormity positioning method, abnormity positioning device, abnormity positioning equipment and storage medium
CN117033464A (en) * 2023-08-11 2023-11-10 上海鼎茂信息技术有限公司 Log parallel analysis algorithm based on clustering and application

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN106339293A (en) * 2016-08-20 2017-01-18 南京理工大学 Signature-based log event extracting method
CN108536792A (en) * 2018-03-30 2018-09-14 东华大学 A kind of file classification method of the text representation strategy based on more words
CN109343990A (en) * 2018-09-25 2019-02-15 江苏润和软件股份有限公司 A kind of cloud computing system method for detecting abnormality based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN106339293A (en) * 2016-08-20 2017-01-18 南京理工大学 Signature-based log event extracting method
CN108536792A (en) * 2018-03-30 2018-09-14 东华大学 A kind of file classification method of the text representation strategy based on more words
CN109343990A (en) * 2018-09-25 2019-02-15 江苏润和软件股份有限公司 A kind of cloud computing system method for detecting abnormality based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
衷宜: "Xen 虚拟化平台下基于系统调用分析的语义重构方法", 《南京理工大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143312A (en) * 2019-12-24 2020-05-12 广东电科院能源技术有限责任公司 Format analysis method, device, equipment and storage medium for power logs
CN113742116A (en) * 2020-11-27 2021-12-03 北京沃东天骏信息技术有限公司 Abnormity positioning method, abnormity positioning device, abnormity positioning equipment and storage medium
CN117033464A (en) * 2023-08-11 2023-11-10 上海鼎茂信息技术有限公司 Log parallel analysis algorithm based on clustering and application
CN117033464B (en) * 2023-08-11 2024-04-02 上海鼎茂信息技术有限公司 Log parallel analysis algorithm based on clustering and application

Also Published As

Publication number Publication date
CN110347827B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
US20230334254A1 (en) Fact checking
EP3401802A1 (en) Webpage training method and device, and search intention identification method and device
CN106383877B (en) Social media online short text clustering and topic detection method
CN110175158B (en) Log template extraction method and system based on vectorization
CN108304442B (en) Text information processing method and device and storage medium
CN110795919A (en) Method, device, equipment and medium for extracting table in PDF document
CN110929145B (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN110347827A (en) Event Distillation method towards isomery text operation/maintenance data
CN111666415A (en) Topic clustering method and device, electronic equipment and storage medium
Ikeda et al. Semi-Supervised Learning for Blog Classification.
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN112836509A (en) Expert system knowledge base construction method and system
CN113011889A (en) Account abnormity identification method, system, device, equipment and medium
Fang et al. Improving the quality of crowdsourced image labeling via label similarity
CN114238573A (en) Information pushing method and device based on text countermeasure sample
US10467276B2 (en) Systems and methods for merging electronic data collections
CN109857892B (en) Semi-supervised cross-modal Hash retrieval method based on class label transfer
Jain et al. Database-agnostic workload management
CN113723542A (en) Log clustering processing method and system
CN115210705A (en) Vector embedding model for relational tables with invalid or equivalent values
Yang et al. IF-MCA: Importance factor-based multiple correspondence analysis for multimedia data analytics
CN110264311B (en) Business promotion information accurate recommendation method and system based on deep learning
US11886467B2 (en) Method, apparatus, and computer-readable medium for efficiently classifying a data object of unknown type
CN115098679A (en) Method, device, equipment and medium for detecting abnormality of text classification labeling sample
CN114996360A (en) Data analysis method, system, readable storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant