CN110347827A - Event Distillation method towards isomery text operation/maintenance data - Google Patents
Event Distillation method towards isomery text operation/maintenance data Download PDFInfo
- Publication number
- CN110347827A CN110347827A CN201910561157.4A CN201910561157A CN110347827A CN 110347827 A CN110347827 A CN 110347827A CN 201910561157 A CN201910561157 A CN 201910561157A CN 110347827 A CN110347827 A CN 110347827A
- Authority
- CN
- China
- Prior art keywords
- maintenance data
- cluster
- type
- sim
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012423 maintenance Methods 0.000 title claims abstract description 98
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000004821 distillation Methods 0.000 title abstract 4
- 230000014509 gene expression Effects 0.000 claims abstract description 70
- 238000001514 detection method Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 8
- 238000011524 similarity measure Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 10
- 238000005259 measurement Methods 0.000 claims description 5
- 230000009191 jumping Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000011160 research Methods 0.000 abstract description 2
- 241001269238 Data Species 0.000 abstract 1
- 230000006870 function Effects 0.000 description 8
- 238000005065 mining Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 238000003909 pattern recognition Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Abstract
The present invention provides a kind of Event Distillation methods towards isomery text operation/maintenance data, comprising the following steps: step 1, defines extensive type regular expression;Step 2, based on the type detection of regular expression;Step 3, the text operation/maintenance data cluster based on various dimensions similitude;Step 4, event type generates and text operation/maintenance data type marks.The purpose of Event Distillation method provided by the present invention is the texts class operation/maintenance datas such as the journal file, the work ticket that generate when running using complicated IT system as research object, a kind of Event Distillation method towards isomery text operation/maintenance data is provided, has the adaptability and higher accuracy of processing isomery text operation/maintenance data.
Description
Technical Field
The invention relates to an event mining technology, in particular to an event extraction method for heterogeneous text operation and maintenance data.
Background
Event mining is crucial for system failure prediction, however, an acceptable logging standard does not exist, and therefore, how to quickly analyze log data from heterogeneous systems and other operation and maintenance data, such as work tickets and the like, is a very challenging problem.
Currently known log pattern discovery methods are mainly divided into two main categories: 1) a matching method based on regular expressions; 2) a method of pattern recognition based on clustering. Many companies have developed tools for log analysis, such as: splunk, logly, LogEntries, etc., and some open source software packages, such as: ElasticSearch, Graylog, OSSIM, etc., which mostly use regular expressions to match log data. The regular expression is utilized to analyze the log data, the log mode can be completely mined usually, however, a lot of prior knowledge and manual intervention are needed, the ability of learning knowledge from historical log data is not provided, and the method is not suitable for a large number of heterogeneous logs. Moreover, different regular expressions can only be used for specific systems, and are not flexible enough and cannot be expanded. In addition, the characteristics of complex writing process and easy generation of conflict of the regular expression also bring great difficulty to log analysis work, and especially, the efficiency of processing log data is reduced by excessively generalized regular expression rules. Therefore, the log is generally preprocessed by regular expressions, common types are marked, and then other clustering or pattern recognition algorithms are used for further analysis and mining, so that the precision and efficiency of log analysis can be remarkably improved on the premise of adding a small amount of priori knowledge. Cloya et al (cloya, guo lou. machine learning based log parsing system design and implementation [ J ] computer application, 2018,38(02): 352-. The LogSig algorithm is a log parsing method based on "signature" and refers to a most representative phrase structure in an event type as "signature". The algorithm groups all log data into k clusters and finds a log signature in each cluster so that all logs in a cluster match this signature as closely as possible. Since log text is typically short, once a signature appears, it can be classified accurately. Zhuge et al (Zhuge C, Vaarandi R. effective Event Log minimizing with LogCluster C [ C ]// Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and AndIEEE International Conference on Intelligent Data and Security (IDS),2017 IEEE3rd International Conference on IEEE,2017: 261-. LogCluster is essentially a frequent-word-based clustering algorithm, i.e., logs with the same frequent words will be clustered together. The method utilizes the characteristic of high inclined distribution of words in the log to perform clustering, and the characteristic is also applied to a plurality of log mining clustering algorithms. Makanju et al performed a series of work on log data analysis. An iterative clustering algorithm for logs, IPLoM, is proposed in the literature (Makanju A, Zincir-Heywood A N, Milios E E, et al, Spatio-temporal decomposition, clustering and clustering for alert detection in system locations [ C ]// proceeding of the 27th Annual ACM Symposium on Applied computing. ACM,2012: 621-: 1) aggregating logs of the same length together; 2) each cluster is divided by words with optimal information gain; 3) taking the word with the current best information gain for further division; 4) a final clustering result is generated based on the majority vote. Experiments show that the IPLoM is superior to other log clustering algorithms, but the IPLoM is easy to generate small clustering fragments without statistical significance, and the clustering quality is difficult to control. Because the final clustering result is related to the clustering effect of the first step, if the clustering effect of the first step is poor, the final clustering effect is difficult to satisfy. However, the IPLoM algorithm assumes that logs of the same length have the same format, and this problem makes the algorithm unsuitable for use in large amounts of heterogeneous log data. Wurzenberger et al (Wurzenberger M, Skopik F, Landauer M, et al. analytical clustering for semi-experimental and temporal detection applied log data [ C ]// Proceedings of the 12th International Conference on Availability, Reliability and security. ACM,2017:31-36.) propose a semi-supervised incremental clustering algorithm to cluster fast growing log data online, avoiding the need for recalculation each time a new log appears. Liu et al (Liu J, Li K, Li Y, et al. Attack Pattern Mining Algorithm Based on fuzzy clustering and Sequence Pattern from Security Log [ C ]// International Conference on Intelligent Information high and Multimedia Signal processing. Springer, Cham,2018:44-52.) studied attack Pattern Mining algorithms Based on improved fuzzy clustering and Sequence Pattern Mining. The method combines the advantages of fuzzy clustering to mine the similarity between the security logs and the advantages of the sequence mode, thereby discovering the logical relationship in the attack step, and experimental results show that the algorithm can effectively mine the attack mode. Xu et al (Xu C, Chen S, Cheng J. network user inter pattern minimal essential clustering algorithm [ C ]// Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC),2015 International Conference on. IEEE,2015:200-204) propose an algorithm for clustering web logs without self-defined parameters, the time complexity of the algorithm is O (n3), where n is the number of logs, the complexity is high, and cannot be extended to large data sets. Xia Ning et al (X.Ning and G.Jiang, "HLAer: A system for Heterogeneous log analysis," in Proceedings of the SDMWorkshop on Heterogeneous Loarning, 2014) have studied an unsupervised HLAer framework for automatically parsing Heterogeneous log data, which is robust to Heterogeneous logs but requires a large amount of memory overhead at runtime, and is therefore also not scalable. The common problem with the above algorithms or tools is that: cannot be extended to heterogeneous operation and maintenance data sets.
Disclosure of Invention
The invention aims to provide an event extraction method for heterogeneous text operation and maintenance data.
The technical scheme for realizing the purpose of the invention is as follows: an event extraction method for heterogeneous text operation and maintenance data comprises the following steps:
step 1, defining a generalization type regular expression: a group of regular expressions are predefined by using dates, time, IP addresses and assignment expressions to describe the dates, the time, the IP addresses and the assignment expressions appearing in the text operation and maintenance data, and a generalization representation type is associated with each regular expression;
step 2, detecting based on the type of the regular expression: preprocessing the given text operation and maintenance data by adopting a predefined regular expression, then detecting the type of each substring, identifying the date, time, IP address and assignment expression, and replacing specific variable values with generalized expression types of the substrings;
step 3, clustering the text operation and maintenance data based on multi-dimensional similarity: integrating three factors of grammar, structure and semantics of the text operation and maintenance data, defining similarity measurement of the text operation and maintenance data, and completing the division of the text operation and maintenance data by adopting a one-pass thought and a density-based clustering algorithm to form a text operation and maintenance data cluster;
step 4, event type generation and text operation and maintenance data type labeling: and generating event types represented by the clusters by adopting a manner of combining the operation and maintenance data in the clusters one by one, and associating each piece of text operation and maintenance data in the clusters with the event type corresponding to the cluster.
Further, the specific steps of step 1 are as follows:
step 1.1, a generalized type set T ═ DATE, TIME, IP address, and assignment expression is defined, where DATE denotes DATE information, TIME denotes TIME information, IP denotes internet address information, Exp denotes an assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";
and 1.2, associating a group of regular expressions for each generalization type T epsilon T to describe different expression forms which may appear in the text operation and maintenance data.
Further, the specific steps of step 2 are as follows:
step 2.1, dividing any each operation and maintenance data D into character strings formed by words by using marks such as spaces or symbols, wherein D belongs to D, and D is a set of the operation and maintenance data;
step 2.2, applying the defined regular expression set E to each substring s of the operation and maintenance data, if one substring s is a predefined example of any regular expression E, successfully generalizing the type of the substring s, and executing the step 2.2.1; otherwise, executing step 2.2.2; wherein s belongs to d, E belongs to E;
step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e;
step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and the regular expression library and the generalization expression type set thereof are updated;
and 2.3, D ═ D- { D }, if D | ≠ 0, indicating that the type detection is not finished, skipping to the step
Further, the specific steps of step 3 are as follows:
step 3.1, for any two pieces of text operation and maintenance data d1,d2E is e.g. D, has D1=p1p2...pn,d2=q1q2...qmWherein p is1p2...pn,q1q2...qmAre respectively d1And d2N is less than or equal to m;
step 3.2 define grammar similarity measure sim1(d1,d2)
Wherein the content of the first and second substances,t(pi)、t(qi) Respectively represent operation and maintenance data d1、d2The regular expression type of the ith term or the ith word of (1);
step 3.3, defining a structural similarity metric sim2(d1,d2)
sim2(d1,d2)=2|lcs(d1,d2)|-|d2|
Wherein the function lcs () obtains the string d1And d2The longest common substring of;
step 3.4 definition of semantic similarityQuantity sim3(d1,d2)
Wherein the function if (w) represents the word frequency, sim, of the word ww(w,d2) Representing word q and sentence d2The maximum word similarity of the Chinese words,
simw(w,d1)=max{simw(w,pi)|i=1,...,n}
simw(w,d2)=max{simw(w,qj)|j=1,...,m};
step 3.5, synthesize grammar, structure and semantic similarity measure, produce the overall similarity measure sim (d)1,d2)
Wherein, wiThe weights representing the different measures of similarity are,
and 3.6, giving operation and maintenance data D, and finishing the division of the text operation and maintenance data by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster.
Further, the specific process of step 3.6 is:
step 3.6.1, define parameter dmaxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xdmaxSetting the cluster number as k, initializing k as 0, and recording the cluster set as C as { C }1,c2,...ckIn which c iskRepresents a cluster center;
step 3.6.2, processing the operation and maintenance data D in the step D one by one:
in step 3.6.2.1, if k is equal to 0, k + is equal to 1, and d is assigned to cluster c1And d as a cluster c1The center of (a);
at step 3.6.2.2, a similarity metric { sim (d, c) is calculated for d and each cluster centeri)|i=1,...,kIf there is a cluster ciSatisfy min (sim (d, c)i) Dmax ≦ d), then d is assigned to cluster ciOtherwise, a new cluster c is createdk+1D is allocated to cluster ck+1And d as a cluster ck+1K + ═ 1;
step 3.6.3, D ═ D- { D }, if | D | ≠ 0, it indicates that the clustering process is not completed yet, then step 3.6.2 is skipped;
step 3.6.4, forming clustered cluster C.
Further, the specific steps of step 4 are as follows:
step 4.1, is an arbitrary cluster ciGenerating an event type, wherein ci∈C,ci={d1,d2,...,dg},g=|ci|:
Step 4.2, for cluster ciAny two pieces of operation and maintenance data dx,dy∈ciX is more than or equal to 1, y is less than or equal to g, d'i=null;
Step 4.3, operation and maintenance data dx、dyAligning to obtain operation and maintenance data pairs d 'with equal length'x、d'y;
Step 4.4, merge d'x、d'yTo obtain d'i
d'i=strcat(d'i,f(d'x(i),d'y(i))|i=1,...,l)
Wherein l ═ d'xL, function strcat () is a string concatenation function,
type (×) indicates generalized Type;
step 4.5, ci=ci-{dx,dy},ci=ci∪{d'iIs a value of | c }iIf the value is greater than 1, skipping to the step 4.2;
step 4.6, obtaining d'iI.e. is a cluster ciThe type of event of (2);
step 4.7, for arbitrary clusters ci,ci={d1,d2,...,dg},g=|ciL, each piece of operation and maintenance data in the cluster is marked with an event type of d'i。
Compared with the prior art, the invention has the advantages that: (1) the invention provides an event extraction method for heterogeneous text operation and maintenance data by taking text operation and maintenance data such as log files, work tickets and the like generated when a complex IT system operates as research objects, and a specific event type is marked for each text operation and maintenance data; (2) the mode of realizing type detection by adopting the regular expression can improve the adaptability of processing heterogeneous text operation and maintenance data; (3) the method has the advantages that multi-dimensional similarity measurement is designed, the accuracy of event extraction can be improved, and particularly, the semantic similarity measurement can increase the measurement accuracy in a heterogeneous scene; (4) the one-pass clustering idea is applied, the event extraction efficiency can be improved, and the method is suitable for real-time processing scenes.
The present invention is described in further detail below with reference to the attached drawing figures.
Drawings
Fig. 1 is a flowchart of an event extraction method for heterogeneous text operation and maintenance data according to the present invention.
Fig. 2 is a schematic diagram of heterogeneous text operation and maintenance data.
Detailed Description
In the invention, one regular expression set is composed of a plurality of regular expressions. And applying each regular expression to the substring s of the operation and maintenance data to judge whether s meets the regular expression. A substring is a basic concept in a string of characters, representing a part of a given string that holds a word or alphabetical order, for example: if the character string abdfgd, then adf, ag, etc. are substrings, and gd is not a substring.
In the present invention, an example refers to a specific character string satisfying a regular expression, for example, if the regular expression representing the year is defined as "d {4 }", then "2018" and the like are examples.
With reference to fig. 1, an event extraction method for heterogeneous text operation and maintenance data includes the following steps:
step 1, defining a generalization type regular expression, and the process is as follows:
step 1.1, defining a generalized type set T ═ DATE, TIME, IP address, assignment expression, and other dimensions, where DATE represents DATE information, TIME represents TIME information, IP represents internet address information, Exp represents assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";
step 1.2, associating a group of regular expressions for describing different expression forms which may appear in the text operation and maintenance data for each generalization type T e T, such as an example of a type Date, which may be expressed as "2019-05-28", may also be expressed as "05-28-2019", may also be expressed as "2019.5.28", and the like; the corresponding regular expression is defined as the set of all regular expressions
E={((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\,?\s+\d{4})|\d{4}\-[0-1]\d\-[0-3]\d};
Step 2, with reference to fig. 2, based on the type detection of the regular expression, the process is as follows:
step 2.1, preprocessing any each piece of operation and maintenance data D, D ∈ D, and marking each piece of data by using a space or a symbol;
step 2.2, applying the regular expression set E defined in the step 1 to each substring s E d of the operation and maintenance data, if a substring s is a predefined example of any regular expression E, E E represents that the type generalization of the substring s is successful, executing the step 2.2.1, otherwise executing the step 2.2.2;
step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e; for example, DATE for 'Feb 26,2016', TIME for '4: 05:26 PM';
step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and a regular expression library and a generalization expression type set thereof are updated by depending on a domain expert;
step 2.3, D ═ D- { D }, if | D | ≠ 0, indicating that the type detection is not completed, then jumping to step 2.1;
step 3, clustering the text operation and maintenance data based on multi-dimensional similarity, wherein the process is as follows:
and 3.1, calculating the similarity of any two text operation and maintenance data. For any two pieces of text operation and maintenance data d1,d2E.g. D, expressed as D1=p1p2...pn,d2=q1q2...qmWherein p is1p2...pn,q1q2...qmAre respectively d1And d2Without loss of generality n ≦ m.
Step 3.1.1, defining grammar similarity measuresWhereinAlpha is a user-defined parameter, alpha is more than or equal to 0 and less than or equal to 1, and t (p)i)、t(qi) Respectively represent operation and maintenance data d1、d2The ith term or the ith word of (1).
Step 3.1.2, define the structural similarity metric sim2(d1,d2)=2|lcs(d1,d2)|-|d2L, where the function lcs () obtains the string d1And d2The longest common substring of;
step 3.1.3, defining semantic similarity measure sim3(d1,d2)
Wherein the function if (w) represents the word frequency, sim, of the word ww(w,d2) Representing word q and sentence d2The maximum word similarity of the Chinese words,
simw(w,d1)=max{simw(w,pi)|i=1,...,n}
simw(w,d2)=max{simw(w,qj)|j=1,...,m};
step 3.1.4, synthesize grammar, structure and semantic similarity measure, produce the comprehensive similarity measureWherein wiThe weights representing the different measures of similarity are,
step 3.2, giving operation and maintenance data D, and finishing text operation and maintenance data division by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster;
and 3.2.1, initializing parameters. Defining a parameter dmaxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xdmax(ii) a Setting the cluster number as k, k as 0, and the cluster set as C as { C }1,c2,...ckIn which c iskRepresents a cluster center;
step 3.2.2, processing the operation and maintenance data D in the D one by one,
step 3.2.2.1, if k is 0, then k + ═ 1, assigning d to cluster c1And d as a cluster c1The center of (a);
step 3.2.2.2, calculate the similarity metric { sim (d, c) of d to each cluster centeri)|i=1,...,kIf there is a cluster ciSatisfy min (sim (d, c)i) Dmax ≦ d), then d is assigned to cluster ciOtherwise, a new cluster c is createdk+1D is allocated to cluster ck+1And d as a cluster ck+1K + ═ 1;
3.2.3, if D ≠ D- { D }, if | D | ≠ 0, which indicates that the clustering process is not completed yet, then jumping to the step 3.2.2;
and 3.2.4, forming a clustered cluster C.
Step 4, generating event types and labeling operation and maintenance data types, wherein the process is as follows:
step 4.1, is an arbitrary cluster ci∈C,ci={d1,d2,...,dg},g=|ciGenerating an event type;
step 4.1.1, for cluster ciAny two pieces of operation and maintenance data dx,dy∈ciX is more than or equal to 1, y is less than or equal to g, d'i=null;
Step 4.1.1.1, applying Smith-Waterman algorithm to convert operation and maintenance data dx、dyAligning to obtain operation and maintenance data pairs d 'with equal length'x、d'y,ll=|d'x|
Step 4.1.1.2, merge d'x、d'yTo obtain d'i,
d'i=strcat(d'i,f(d'x(i),d'y(i))|i=1,...,l)
Where function strcat () is a string join function
Where Type (#) denotes a generalized Type.
Step 4.1.1.3, ci=ci-{dx,dy},ci=ci∪{d'iIs a value of | c }iIf the value is greater than 1, skipping to the step 4.1.1;
step 4.1.1.4 to obtain d'iI.e. is a cluster ciThe type of event of (2);
step 4.2, for any cluster ci={d1,d2,...,dg},g=|ciD 'is marked to each operation and maintenance data in the I cluster by event type'i。
Claims (6)
1. An event extraction method for heterogeneous text operation and maintenance data is characterized by comprising the following steps:
step 1, defining a generalization type regular expression: a group of regular expressions are predefined by using dates, time, IP addresses and assignment expressions to describe the dates, the time, the IP addresses and the assignment expressions appearing in the text operation and maintenance data, and a generalization representation type is associated with each regular expression;
step 2, detecting based on the type of the regular expression: preprocessing the given text operation and maintenance data by adopting a predefined regular expression, then detecting the type of each substring, identifying the date, time, IP address and assignment expression, and replacing specific variable values with generalized expression types of the substrings;
step 3, clustering the text operation and maintenance data based on multi-dimensional similarity: integrating three factors of grammar, structure and semantics of the text operation and maintenance data, defining similarity measurement of the text operation and maintenance data, and completing the division of the text operation and maintenance data by adopting a one-pass thought and a density-based clustering algorithm to form a text operation and maintenance data cluster;
step 4, event type generation and text operation and maintenance data type labeling: and generating event types represented by the clusters by adopting a manner of combining the operation and maintenance data in the clusters one by one, and associating each piece of text operation and maintenance data in the clusters with the event type corresponding to the cluster.
2. The method according to claim 1, wherein the specific steps of step 1 are as follows:
step 1.1, a generalized type set T ═ DATE, TIME, IP address, and assignment expression is defined, where DATE denotes DATE information, TIME denotes TIME information, IP denotes internet address information, Exp denotes an assignment expression using the symbol "═ Exp: the representation uses the symbol ": "and Exp [ ] represents an assignment expression using the symbol" [ ] ";
and 1.2, associating a group of regular expressions for each generalization type T epsilon T to describe different expression forms which may appear in the text operation and maintenance data.
3. The method according to claim 1, wherein the specific steps of step 2 are as follows:
step 2.1, dividing any each operation and maintenance data D into character strings formed by words by using marks such as spaces or symbols, wherein D belongs to D, and D is a set of the operation and maintenance data;
step 2.2, applying the defined regular expression set E to each substring s of the operation and maintenance data, if one substring s is a predefined example of any regular expression E, successfully generalizing the type of the substring s, and executing the step 2.2.1; otherwise, executing step 2.2.2; wherein s belongs to d, E belongs to E;
step 2.2.1, replacing the substring s with a generalized representation type t corresponding to a regular expression e;
step 2.2.2, the operation and maintenance data d are generated by a new operation and maintenance data template, and the regular expression library and the generalization expression type set thereof are updated;
and 2.3, D ═ D- { D }, and if | D | ≠ 0, which indicates that the type detection is not finished, jumping to the step 2.1.
4. The method according to claim 1, wherein the specific steps of step 3 are as follows:
step 3.1, for any two pieces of text operation and maintenance data d1,d2E is e.g. D, has D1=p1p2...pn,d2=q1q2...qm, wherein ,p1p2...pn,q1q2...qmAre respectively d1 and d2N is less than or equal to m;
step 3.2 define grammar similarity measure sim1(d1,d2)
wherein ,t(pi)、t(qi) Respectively represent operation and maintenance data d1、d2Item i or iA regular expression type of an individual word;
step 3.3, defining a structural similarity metric sim2(d1,d2)
sim2(d1,d2)=2|lcs(d1,d2)|-|d2|
Wherein the function lcs () obtains the string d1 and d2The longest common substring of;
step 3.4 defining semantic similarity measure sim3(d1,d2)
Wherein the function if (w) represents the word frequency, sim, of the word ww(w,d2) Representing word q and sentence d2The maximum word similarity of the Chinese words,
simw(w,d1)=max{simw(w,pi)|i=1,...,n}
simw(w,d2)=max{simw(w,qj)|j=1,...,m};
step 3.5, synthesize grammar, structure and semantic similarity measure, produce the overall similarity measure sim (d)1,d2)
wherein ,wiThe weights representing the different measures of similarity are,
and 3.6, giving operation and maintenance data D, and finishing the division of the text operation and maintenance data by applying a clustering algorithm based on a one-pass thought to form a text operation and maintenance data cluster.
5. The method according to claim 4, wherein the specific process of step 3.6 is as follows:
step 3.6.1, define parameter dmaxThe maximum distance between the operation and maintenance data and the cluster center is represented, and the maximum distance between any two operation and maintenance data in the same cluster is 2 xdmaxSetting the cluster number as k, initializing k as 0, and recording the cluster set as C as { C }1,c2,...ck}, wherein ckRepresents a cluster center;
step 3.6.2, processing the operation and maintenance data D in the step D one by one:
in step 3.6.2.1, if k is equal to 0, k + is equal to 1, and d is assigned to cluster c1And d as a cluster c1The center of (a);
at step 3.6.2.2, a similarity metric { sim (d, c) is calculated for d and each cluster centeri)|i=1,...,kIf there is a cluster ciSatisfy min (sim (d, c)i) Dmax ≦ d), then d is assigned to cluster ciOtherwise, a new cluster c is createdk+1D is allocated to cluster ck+1And d as a cluster ck+1K + ═ 1;
step 3.6.3, D ═ D- { D }, if | D | ≠ 0, it indicates that the clustering process is not completed yet, then step 3.6.2 is skipped;
step 3.6.4, forming clustered cluster C.
6. The method according to claim 5, wherein the specific steps of step 4 are as follows:
step 4.1, is an arbitrary cluster ciGenerating an event type, wherein ci∈C,ci={d1,d2,...,dg},g=|ci|:
Step 4.2, for cluster ciAny two pieces of operation and maintenance data dx,dy∈ciX is more than or equal to 1, y is less than or equal to g, d'i=null;
Step 4.3, operation and maintenance data dx、dyAligning to obtain operation and maintenance data pairs d 'with equal length'x、d'y;
Step 4.4, merge d'x、d'yTo obtain d'i
d'i=strcat(d'i,f(d'x(i),d'y(i))|i=1,...,l)
Wherein l ═ d'xL, function strcat () is a string concatenation function,
type (×) indicates generalized Type;
step 4.5, ci=ci-{dx,dy},ci=ci∪{d'iIs a value of | c }iIf the value is greater than 1, skipping to the step 4.2;
step 4.6, obtaining d'iI.e. is a cluster ciThe type of event of (2);
step 4.7, for arbitrary clusters ci,ci={d1,d2,...,dg},g=|ciL, each piece of operation and maintenance data in the cluster is marked with an event type of d'i。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910561157.4A CN110347827B (en) | 2019-06-26 | 2019-06-26 | Event Extraction Method for Heterogeneous Text Operation and Maintenance Data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910561157.4A CN110347827B (en) | 2019-06-26 | 2019-06-26 | Event Extraction Method for Heterogeneous Text Operation and Maintenance Data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110347827A true CN110347827A (en) | 2019-10-18 |
CN110347827B CN110347827B (en) | 2023-08-22 |
Family
ID=68183197
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910561157.4A Active CN110347827B (en) | 2019-06-26 | 2019-06-26 | Event Extraction Method for Heterogeneous Text Operation and Maintenance Data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110347827B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143312A (en) * | 2019-12-24 | 2020-05-12 | 广东电科院能源技术有限责任公司 | Format analysis method, device, equipment and storage medium for power logs |
CN113742116A (en) * | 2020-11-27 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Abnormity positioning method, abnormity positioning device, abnormity positioning equipment and storage medium |
CN117033464A (en) * | 2023-08-11 | 2023-11-10 | 上海鼎茂信息技术有限公司 | Log parallel analysis algorithm based on clustering and application |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN106339293A (en) * | 2016-08-20 | 2017-01-18 | 南京理工大学 | Signature-based log event extracting method |
CN108536792A (en) * | 2018-03-30 | 2018-09-14 | 东华大学 | A kind of file classification method of the text representation strategy based on more words |
CN109343990A (en) * | 2018-09-25 | 2019-02-15 | 江苏润和软件股份有限公司 | A kind of cloud computing system method for detecting abnormality based on deep learning |
-
2019
- 2019-06-26 CN CN201910561157.4A patent/CN110347827B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN106339293A (en) * | 2016-08-20 | 2017-01-18 | 南京理工大学 | Signature-based log event extracting method |
CN108536792A (en) * | 2018-03-30 | 2018-09-14 | 东华大学 | A kind of file classification method of the text representation strategy based on more words |
CN109343990A (en) * | 2018-09-25 | 2019-02-15 | 江苏润和软件股份有限公司 | A kind of cloud computing system method for detecting abnormality based on deep learning |
Non-Patent Citations (1)
Title |
---|
衷宜: "Xen 虚拟化平台下基于系统调用分析的语义重构方法", 《南京理工大学学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143312A (en) * | 2019-12-24 | 2020-05-12 | 广东电科院能源技术有限责任公司 | Format analysis method, device, equipment and storage medium for power logs |
CN113742116A (en) * | 2020-11-27 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Abnormity positioning method, abnormity positioning device, abnormity positioning equipment and storage medium |
CN117033464A (en) * | 2023-08-11 | 2023-11-10 | 上海鼎茂信息技术有限公司 | Log parallel analysis algorithm based on clustering and application |
CN117033464B (en) * | 2023-08-11 | 2024-04-02 | 上海鼎茂信息技术有限公司 | Log parallel analysis algorithm based on clustering and application |
Also Published As
Publication number | Publication date |
---|---|
CN110347827B (en) | 2023-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230334254A1 (en) | Fact checking | |
EP3401802A1 (en) | Webpage training method and device, and search intention identification method and device | |
CN106383877B (en) | Social media online short text clustering and topic detection method | |
CN110175158B (en) | Log template extraction method and system based on vectorization | |
CN108304442B (en) | Text information processing method and device and storage medium | |
CN110795919A (en) | Method, device, equipment and medium for extracting table in PDF document | |
CN110929145B (en) | Public opinion analysis method, public opinion analysis device, computer device and storage medium | |
CN110347827A (en) | Event Distillation method towards isomery text operation/maintenance data | |
CN111666415A (en) | Topic clustering method and device, electronic equipment and storage medium | |
Ikeda et al. | Semi-Supervised Learning for Blog Classification. | |
CN112883730B (en) | Similar text matching method and device, electronic equipment and storage medium | |
CN112836509A (en) | Expert system knowledge base construction method and system | |
CN113011889A (en) | Account abnormity identification method, system, device, equipment and medium | |
Fang et al. | Improving the quality of crowdsourced image labeling via label similarity | |
CN114238573A (en) | Information pushing method and device based on text countermeasure sample | |
US10467276B2 (en) | Systems and methods for merging electronic data collections | |
CN109857892B (en) | Semi-supervised cross-modal Hash retrieval method based on class label transfer | |
Jain et al. | Database-agnostic workload management | |
CN113723542A (en) | Log clustering processing method and system | |
CN115210705A (en) | Vector embedding model for relational tables with invalid or equivalent values | |
Yang et al. | IF-MCA: Importance factor-based multiple correspondence analysis for multimedia data analytics | |
CN110264311B (en) | Business promotion information accurate recommendation method and system based on deep learning | |
US11886467B2 (en) | Method, apparatus, and computer-readable medium for efficiently classifying a data object of unknown type | |
CN115098679A (en) | Method, device, equipment and medium for detecting abnormality of text classification labeling sample | |
CN114996360A (en) | Data analysis method, system, readable storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |