CN114398898B

CN114398898B - Method for generating KPI curve and marking wave band characteristics based on log event relation

Info

Publication number: CN114398898B
Application number: CN202210292597.6A
Authority: CN
Inventors: 戴曦; 尹立超; 徐旭朝
Original assignee: Three Gorges Zhikong Technology Co ltd
Current assignee: Three Gorges Zhikong Technology Co ltd
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-06-24
Anticipated expiration: 2042-03-24
Also published as: CN114398898A

Abstract

The invention discloses a method for generating a KPI curve based on a log event relation and marking waveband features, which comprises the steps of firstly generating a log KPI curve according to the relation of events in a log, then dividing the KPI curve into a plurality of wavebands with equal lengths, clustering the wavebands into a plurality of clusters according to the non-time dimension of the wavebands, extracting the fundamental wave of each cluster, comparing the similarity of each waveband data of each cluster and the fundamental wave, finding out the grouping boundary line of each cluster, grouping each waveband data of each cluster, extracting the total time length of continuous similar wavebands in each cluster, and taking the maximum value of the total time length as the width of a sliding window. The window is used for segmenting the KPI curve, so that the wave bands in each segmented window are easy to cluster and classify, the whole KPI curve is favorably and rapidly divided into wave band chains consisting of different types of wave bands, then the KPI curve of an individual monitoring index is subjected to periodic detection and type detection marking, the individual KPI curve is segmented by using the window, and the wave bands in the fundamental wave KPI curve are used for grouping and labeling.

Description

Method for generating KPI curve and marking wave band characteristics based on log event relation

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method for generating a KPI curve and marking wave band characteristics based on log event relation.

Background

Outlier detection (also known as outlier detection) is a detection process that finds objects whose behavior is different from that of the expected objects, which are called outliers or outliers. The anomaly detection means generally includes a statistical-based model, a distance-based model, a linear-transformation model, a nonlinear-transformation model, a machine-learning model, and the like.

Kpis (key performance indicators) refer to monitoring metrics (e.g., delay, throughput, etc. in a network) for objects such as services, systems, etc. The storage form is a sequence formed by arranging the occurrence time sequence, namely a time sequence which is generally called. The abnormal detection of the time series is to check whether the current data is obviously deviated from the normal condition through historical data analysis. KPI data anomaly detection has very important meaning: through real-time monitoring of KPI data, the abnormality of KPI data is discovered, and corresponding processing is carried out in time, so that the normal operation of the application is ensured.

Methods for performing real-time anomaly detection by setting a threshold value for KPI data are quite common, but methods for performing real-time anomaly detection for system logs have not been reported publicly.

In order to pursue effectiveness, a supervision learning mode is mostly adopted in traditional machine learning, abnormal labels are difficult to obtain in batches in practice, accuracy of model output is improved through massive labeled data samples, so that a large number of business experts are needed to label KPI curves manually, repeated adjustment and correction are often needed, time and labor are consumed, and millions and tens of millions of KPIs may need to be monitored at the same time in practice, so that an algorithm cannot be found in actual abnormal detection practice to meet the requirements at the same time, and the above challenges cannot be solved at the same time; the unsupervised learning common clustering technology and the like are mainly used for scenes such as feature discovery, data exploration and the like, and because of lack of labels, the result can be abstractly mapped to a business mode only by being interpreted by a data scientist, and the result cannot be directly acted; in the specific implementation of weak supervision, due to the introduction of an unsupervised/supervised method in stages, the accuracy of circular recursion is improved, the method is too academic and difficult to fall on the ground, and on the other hand, in order to fuse specific methods, vector expression is required to be adopted to unify the representation among different methods, so that the result is difficult to understand by application personnel.

The more the data volume is, the more complex the service scene is, the more complex the introduction manner is, and the more diversified the investment cost/manpower is required. The circulation directly limits the popularization of machine learning in the whole industry, and focuses on the industry with higher income, so that the conventional industry only adopts abandon resistance and passive defense, and flows backwards depending on the average level of the whole industry, and the migration of a service scene is realized, and the method specifically comprises the following steps: if a method is particularly effective in other industries, the person is left with the surplus to borrow the observation effect, if feasible, to use. One such industry of passive defense is an industrial application scenario.

Disclosure of Invention

The first purpose of the invention is to provide a method for generating KPI curves and marking wave band characteristics based on log event relations, which processes text logs generated by monitoring indexes in an industrial control system, combines highly correlated events into a same group, and generates log KPI curves periodically correlated with the KPI curves of the monitored indexes.

The technical scheme of the invention is as follows: a method for generating KPI curves based on log event relations comprises the following steps:

step F1, setting a training sentence subset consisting of training sentences, obtaining a fault log by the industrial control equipment in the same industrial control system based on monitoring indexes, forming a sentence pair to be processed by the corpora in the fault log and each training sentence respectively, calculating the similarity, and deleting the corpora with the similarity lower than a threshold value one;

step F3., performing word segmentation on the residual corpus in the step F2, generating a word segmentation queue consisting of a plurality of characteristic words, and labeling part of speech of the plurality of characteristic words to obtain a part of speech queue of the corpus;

step F3., if the part-of-speech queue contains a plurality of special feature words corresponding to special parts-of-speech, obtaining the boundary and category of the named entity from the plurality of special feature words by using the named entity recognition model, updating the parts-of-speech of the special feature words in the part-of-speech queue to the boundary and category of the named entity, and obtaining an updated part-of-speech queue, wherein the special parts-of-speech includes: number word, time word;

step F4., classifying the residual corpus according to the label of F3 to the residual corpus, counting the occurrence frequency of each category part-of-speech queue, sorting in descending order, selecting the part-of-speech queue with the sequence greater than the second threshold value, counting the occurrence frequency of each verb and noun in each category part-of-speech queue, sorting in descending order, sequentially screening two part-of-speech queue sets with the top rank from the sequence according to the occurrence frequency of verbs and nouns according to the sorting threshold value, extracting the corpus corresponding to the intersection of the two part-of-speech queue sets, and constructing a true training set;

step F5., screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, wherein n represents the part-of-speech of a noun, v represents the part-of-speech of a verb, and extracting first and second participles with parts-of-speech of a noun and a proper noun as an event first and an event second respectively to form an event tuple;

step F6., based on the existing fault event relationship table, using Snowball algorithm to find the event association rule of the event tuple, and finding the association event group in the event tuple according to the event association rule, i.e. generating a log key event relationship table;

step F7. repeats using step F6 based on the log key event relationship table until convergence;

step F8., using each event relation generated in step F7 as a log key event label to mark a fault log, using the frequency of each log key event label appearing per minute as a monitoring index, establishing each log KPI curve, and using a gaussian kernel to smooth each log KPI curve.

Advantageously, the same industrial control system is composed of industrial control devices which have a direct or indirect material supply relationship, an electric energy transfer relationship, a thermal energy transfer relationship, a mechanical energy transfer relationship, a magnetic field transfer relationship, an energy conversion relationship or a signal control relationship, the industrial control devices in the same industrial control system obtain fault logs based on monitoring indexes, the fault logs also have relevance due to the fact that the monitoring indexes have relevance, each record of the monitoring indexes in the logs has partial text difference, direct clustering needs a large amount of manual indexing and screening work, log texts describing behaviors or states of the devices or the devices have similar sentence text structures and similar part-of-speech queue characteristics, texts of the similar part-of-speech queues are screened out in steps F1-F4, and log texts which are not used for recording the behaviors or states of the devices or the devices are eliminated; nouns and nouns in the text often have a specific associated logical relationship, and highly related event relationships can be combined into the same group according to the relationship to generate a log KPI curve periodically related to the KPI curve of the monitored indicator.

Further, the calculating of the similarity in step F1 includes the steps of: respectively segmenting the sentences in the sentence pairs based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;

and converting each characteristic word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold value one.

Further, step F8 is followed by:

extracting a frequency spectrum intensity graph of a log KPI curve by using Fourier transform;

z02, extracting the point with the highest vibration amplitude and calculating the corresponding period, namely the period to be checked;

and Z03, setting an assumed period, namely a waiting period, detecting the correlation strength of the period to be detected if and only if the length of the period to be detected is within the range of 95-105% of the expected period, determining the period to be detected as a period meeting the requirement if the spectrum strength is sufficient, and marking a filtered log KPI curve according to the periodic difference of the log KPI curves, namely a log KPI curve period label.

The period inspection is to mark the waveform with periodic and non-periodic marks, the periodic marks represent that regular and repeated events exist, and the information usually means service information such as state detection on service knowledge and rotating parts; relatively non-periodic in contrast means event traffic. They are all service tags used in other steps and are not related to other operations; the similarity of the periodic KPIs is probably because of similarity relations formed for various reasons, no business relation exists, and the non-periodic KPIs are more probably that direct and indirect relations exist.

Further, step Z03 is followed by:

z04, calculating pairwise similarity of each log KPI curve by using an NCC algorithm, expanding the similarity into a diagonal similarity matrix, and filling the similarity into the similarity matrix, wherein the serial numbers of rows and columns in the matrix are the numbers of the log KPI curves, and the number of rows and columns of the similarity matrix is the number of the log KPI curves;

and Z05, outputting different clusters according to the similarity matrix by using a spectral clustering algorithm, and marking different log KPI curve labels, called KPI curve service labels, for the different clusters.

Advantageously, the KPI curves are clustered and classified according to overall similarity of the KPI curves to form clusters with similar waveforms.

Further, step F6 includes:

c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is a length which can be set arbitrarily, < left > is a vector representation of len words on the left side of the event 1, < middle > is a vector representation of words between the event 1 and the event 2, and < right > is a vector representation of len words on the right side of the event;

c2. clustering the generated templates, clustering the templates with similarity greater than the threshold value three into a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; the template format can be written as known from step C2

，E₁、E₂Respectively indicating an event 1 type and an event 2 type of the template P,

represents E₁The left 3-vocabulary length vector representation,

represents E₁，E₂The vector representation of the vocabulary in between,

represents E₂Vector representation of three word lengths on the right, similarity calculation between templates, template 1:

and a template 2:

if the condition is satisfied

I.e. satisfy the template P₁Event 1 type E of₁And a template P₂Event 1 type of

Identical and template P₁Event 2 type E of₂And a template P₂Event 2 type of

Same, then template P₁And a template P₂Can be determined by

Calculated as mu₁μ₂μ₃Are weighted because

The calculation result of the similarity between the templates is greatly influenced, and mu can be set₂>μ₁>μ₃(ii) a If the condition is not satisfied

If the similarity between the template P1 and the template P2 is 0;

step C3., similarity calculation is carried out on the event tuple templates obtained in the step C1 and the templates in the rule base one by one, the similarity is abandoned if the similarity is smaller than the threshold value three, and the events in the templates with the similarity larger than the threshold value three are added into the log key event relation table to replace the fault event relation table.

The invention also aims to provide a method for marking waveband characteristics by the KPI curve, which comprises the steps of dividing the KPI curve into a plurality of wavebands with equal length, clustering into a plurality of clusters according to the non-time dimension of the wavebands, extracting the fundamental wave of each cluster, comparing the similarity between each waveband data of each cluster and the fundamental wave, finding out the grouping boundary line of each cluster, grouping each waveband data of each cluster, extracting the total time length of continuous similar wavebands in each cluster, and taking the maximum value of the total time length as the width of a sliding window. The window is used for partitioning the log KPI curve, so that the wave bands in each partitioned window are easy to cluster and classify, the whole log KPI curve is favorably and rapidly divided into wave band chains consisting of different types of wave bands, then the log KPI curve of an individual monitoring index is subjected to periodic detection and type detection marking, the individual log KPI curve is partitioned by using the window, and the wave bands in the log KPI curve are subjected to grouping and labeling by using fundamental waves.

The method for marking the wave band characteristics of the KPI curve obtained by the method comprises the following steps:

step A1, merging data points of all minutes in all log KPI curves, dividing the data points into a plurality of band segments with time width of s minutes, clustering the band segments into a plurality of clusters according to non-time dimensions of the band segments, extracting fundamental waves of all the clusters, comparing similarity between band data of all the clusters and the fundamental waves, finding out grouping boundary lines of all the clusters, and grouping the band data of all the clusters;

a2, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;

step A3, performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;

step A4, combining the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to NCC;

step A5, expanding the similarity of the time interval KPI sets among the clusters obtained in the step A4 into a similarity matrix;

a6, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;

a7., marking adjacent clusters with values larger than the inflection point in the similarity matrix as the same similar group, and counting the cluster number of each similar group;

step A8., calculating the total time interval of the group with the most clusters in the similarity group as the width of the sliding window;

step A9. is to divide each log KPI curve into several log KPI curve window segments with time sequence width as total time interval according to the sliding window obtained in step A8, and to divide the log KPI curve window segments into i-segment log KPI curve data sets with time sequence width of 1 minute according to the dividing method in step A1

Each segment is a band;

comparing the similarity of each fundamental wave obtained in the step A1 with each wave band in each window of each log KPI curve one by one, sequencing the similarity from large to small, finding out grouping boundary lines according to the sequencing, grouping the wave bands to form a label chain formed by fundamental wave labels, and acquiring mode waveforms of different KPIs, wherein the mode waveforms are called KPI curve code pattern rearrangement tables;

and A10, placing different KPI curve pattern rearrangement tables in one dimension in a time dimension to obtain a KPI curve pattern rearrangement association table.

Advantageously, the label information obtained after the log KPI curve is processed contains all information of all bands, including two parts of band and waveform representation, the band labels are the fundamental wave type and the time arrangement information of the fundamental wave label, and the waveform label includes two kinds of service labels and period labels.

Different KPI curves may have causal relationships if the same KPI curve traffic label is used, where a KPI belonging to an aperiodic KPI has a higher probability than a periodic KPI curve.

Different KPI curves may have causal relationships if the same KPI curve pattern fundamental signature is present in adjacent time segments, with a higher probability for more repetitions.

Further, step a1 includes the following steps: step J1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets M with the time width of s minutes_iI is the segment number;

step J2. is based on each segment of log KP using the dbscan algorithmCalculating Euclidean distance between each section of data sets by the attribute of the I curve data set, clustering the log KPI curve data sets of the I sections to obtain k clusters and abnormal items, wherein each cluster is a grouped data set, and each grouped data set comprises j sections of log KPI curve data sets F_j；

Step J3. calculating an arithmetic mean value Σ Fj/j of j log KPI curve data sets in each group data set as a fundamental wave of the group;

step J4. uses NCC algorithm to calculate the waveform similarity between each section of log KPI curve data set Fj of each packet data set and the fundamental wave, and arranges from big to small, and the log KPI curve data set F with the waveform similarity ordering of the first 95 percent_jTaking the minimum value of the waveform similarity as the grouping boundary line B of the group_k；

Step J5. uses the NCC algorithm to calculate each segment of the log KPI curve data set M_iWaveform similarity NCC with fundamental wave of each group_Mi-JkJudging whether each section of log KPI curve data set belongs to the group by taking the group boundary line of each group as a reference, sequencing one section of log KPI curve data set simultaneously belonging to a plurality of groups according to the classification score Q, and sequencing a log KPI curve data set M_iGrouping the data into groups with the minimum classification score Q to obtain grouping information of each log KPI curve data set,

Q=((1-NCC_{M i-Jk})/(1-B_k))²。

further, step a7 is replaced with: replacing the similarity value with the value larger than the inflection point in the similarity matrix with 1, and replacing the similarity value with the value lower than the inflection point with 0;

and marking the similarity in the obtained similarity matrix as 1 and adjacent clusters as the same similar group, and counting the cluster number of each similar group.

Further, the step of dividing the KPI curve window segment into bands in step a9 is: and B, carrying out similarity calculation on the fundamental waves obtained in the step A2 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain

And sorting from large to small, and taking the minimum value of the waveform similarity as a grouping boundary line B 'of the grouping in the wave band with the waveform similarity sorting of the first 95 percent'_kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference

Whether belonging to the group or not, for a section of log KPI curve data set simultaneously belonging to a plurality of groups

Score according to classification

Sequencing is carried out, and a log KPI curve data set M is obtained_iGrouping to categorical score

In the minimum grouping, a label chain formed by fundamental wave labels is formed, mode waveforms of different KPIs are obtained, the mode waveforms are called KPI curve code pattern rearrangement tables,

。

further, after all tag chains are arranged according to the time dimension, causal relationships among different tag chains occurring at different times are discovered based on a sequence mining algorithm SPADE or GSP.

Specific nouns in the text of the fault log generated by the industrial control equipment of the same industrial control system have mutual causal influence, and are shown in the way that paired nouns synchronously appear due to the same inducement, similar noun queues can be classified into one class, namely, the event relationship obtained in step F8, and the frequency obtained by counting the event relationship can obtain a log KPI curve, and the log KPI curve appears together with an index KPI curve obtained by monitoring the physical parameter analog quantity by the industrial control equipment, so that the index KPI curve can be classified and clustered into a band chain with tag sorting characteristics, and therefore, the log KPI curve also has the same band chain characteristics, and the band chain characteristics of the index KPI curves generated by the same inducement for different physical parameters are similar, and the band chain characteristics of the log KPI curves generated by the same inducement for different event relationships are also similar.

In order to find the wave band chain, a sliding window with a proper width is adopted to slide along a log KPI curve, a log KPI curve unit segment is intercepted from the window, a plurality of wave bands with equal length are extracted from the log KPI curve unit segment, labels of all the wave bands in the log KPI curve unit segment are marked based on the similarity of characteristic fundamental waves and the wave bands, the log KPI curve unit segment is made into a wave band chain with label sequencing characteristics, thus, one wave band chain is obtained by sliding the window once on the log KPI curve, all the wave band chains are equal in length, only the classification labels of the wave bands are sequenced differently, based on the difference of sequencing characteristics of the wave band chains, after all the wave band chains obtained by the sliding window are arrayed according to time dimension, the causal relation of the wave band chains with different characteristics on the time dimension can be obtained based on sequence mining algorithm SPADE, expert evaluation and knowledge map fusion, the causal relationship between the event relationship and the event relationship is obtained, which is helpful for supplementing a knowledge system for fault identification in the system by experts and discovering the incidence relationship of monitoring indexes which are not discovered before, so that a new early warning control relationship and a new regulation threshold value can be established based on the newly discovered incidence relationship between the monitoring indexes in operation, and the system stability of each monitored object in the same system is improved.

The technical problem solved by the invention is similar to the feature compression code obtained by inputting the waveform to the self-coding network in the prior art CN110726898B, CN110726898B, and is equivalent to extracting the wave band chain based on the KPI curve or inducing the event tuple based on the fault log in the invention. Inputting the compressed codes into a classification model to obtain the type of the fault waveform, which is equivalent to the causal relationship of the band chain with different characteristics on the time dimension, which can be obtained based on the sequence mining algorithm SPADE, expert evaluation and knowledge map fusion of the invention; or just as entering event tuples into an existing fault event relationship table (classification model) and classifying the event tuples into associated event groups based on Snowball.

Drawings

FIG. 1 is a log KPI curve generated from fault logs generated by industrial control equipment in the same industrial control system;

FIG. 2 is a label chain of formed fundamental wave labels;

fig. 3 shows the categories of the log KPI curves generated from the fault log text and clustered.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention. In the following embodiments, the label chain and the band chain are the same meaning, and the KPI curve unit segment and the KPI curve window segment are the same meaning. The same industrial control system is composed of industrial control devices which have a direct or indirect material supply relationship, an electric energy transfer relationship, a heat energy transfer relationship, a mechanical energy transfer relationship, a magnetic field transfer relationship, an energy conversion relationship or a signal control relationship, the industrial control devices in the same industrial control system obtain fault logs based on monitoring indexes, and the fault logs also have correlation due to the fact that the monitoring indexes have correlation.

Example 1

A method for generating KPIs based on log keyword clustering comprises the following steps:

r1 collects fault logs obtained by industrial control equipment in the same power station industrial control system network based on monitoring indexes, event tuples are constructed according to the fault logs, and the fault logs are processed by a snowball algorithm to construct event relations.

The method for constructing the event tuple comprises the following steps:

f1 setting a training sentence subset composed of training sentences, extracting corpora from the fault log to respectively form a sentence pair to be processed with each training sentence, and respectively segmenting the sentences in the sentence pair based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;

f2, converting each feature word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold, wherein the threshold is set to be 0.9;

steps F1-F2 are used for picking out grammars from the fault logs, wherein the semantic structures are sentences used for referring, behavior recording and state description, and the general grammars of the fault logs in the industrial control system are as follows: the description structure of the sentence is less ambiguous, which is beneficial to removing error logs in fault logs and keeping industrial record logs;

segmenting the corpus by using a jieba.

def cut( sentence, cut_ all=False, HMM=True)

Wherein, the sensor is a sentence sample needing word segmentation; cut _ all is a word segmentation mode, jieba word segmentation has a full mode and an accurate mode, and is selected by true and false respectively, and the default is false, namely the accurate mode; HMMs are hidden markov chains that are used in theoretical models of word segmentation, which are turned on by default.

F3 performing word segmentation on the residual corpus in the step F2, forming a word segmentation queue by a plurality of characteristic words, and labeling part of speech on the plurality of characteristic words to obtain a part of speech queue of the corpus;

and returning the category code number to the input word by using a jieba. The Yangjun describes the use steps and the part-of-speech classification table of the jieba.

F4, if the part of speech queue contains a plurality of special feature words corresponding to special parts of speech, obtaining the boundary and the category of a named entity from the special feature words by using a named entity recognition model, and updating the parts of speech of the special feature words in the part of speech queue into the boundary and the category of the named entity to obtain a part of speech queue;

wherein, the special part of speech includes: the method comprises the following steps of (1) counting words and time words, wherein only numerical values and time are classified by parts of speech in the application scene of the embodiment, so that inaccurate identification is easy to occur; for example, in FIG. 3, pulsing the "16: 10:23 (set I) signal for the corpus" allows "the participle to obtain a part-of-speech queue," {16: m,: x,10: m,: x,23: m, (: x, set I: n,): x, signal: n,: v, pulse: n, allow: v } ", where: m, representing a number, x, representing a character string, n, representing a noun, and v, representing a verb. The part-of-speech queue obtained after processing in step F4 is: the step of {16:17:00: t, (: x, set i: n,): x, signal: n, occurrence: v, another channel: n, reception: v } ", avoids the part of speech of the time word which is difficult to identify being marked as a digit, so that the queue containing the time word and the queue containing the digit can be distinguished through the part of speech queue.

The named entity recognition model can recognize named nominal items from the linguistic data to be processed. In a narrow sense, four types of named entities, namely, a name of a person, a name of a place, a name of an organization and a name of a proper noun, are identified. It generally comprises two parts: (1) identifying entity boundaries; (2) entity categories (person name, place name, organization name, or others) are determined. There are a variety of ways to identify named entities, such as: the named entity recognition model may be constructed based on the above-described methods, such as rule-based methods, feature template-based methods, neural network-based methods, and the like.

For example: the named entity recognition model (CRF) carries out entity annotation on a sentence that I comes to the Tujia village, and the result after the accurate annotation is as follows: I/O to/O ceramic/B Home/M village/E (O means that the current word is not a geographical named entity, B ME means that the current word is the head inner tail of the geographical named entity, respectively). To solve this problem, a linear chain CRF is used, then (O, O, O, B, M, E) is one of its tagging sequences and (O, O, O, B, M, E) is also one of its tagging choices.

F5 classifies the residual corpus according to the label of F4 on the residual corpus, counts the occurrence frequency of part-of-speech queues of each category, and counts various types in the part-of-speech queues of each category: frequency of occurrence of verbs and nouns;

f6, sorting the part-of-speech queues of each category in a descending order according to the occurrence frequency of each verb and noun, sequentially screening two part-of-speech queue sets with the top rank from the two sorts according to a sorting threshold, extracting the corpus corresponding to the intersection of the two part-of-speech queue sets, and constructing a true training set;

f7, screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, and extracting a first participle and a second participle with parts-of-speech being nouns or proper nouns from the participle queue as an event I and an event II respectively to form an event tuple;

f8, finding the event association rule of the event tuple by using the Snowball algorithm, finding the association event group in the event tuple according to the event association rule:

c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is the length which can be set arbitrarily, < left > is the vector representation of len vocabularies on the left side of the event 1, < middle > is the vector representation of the vocabularies between the event 1 and the event 2, and < right > is the vector representation of len vocabularies on the right side of the event;

c2. clustering the generated templates, clustering the templates with similarity greater than 0.7 as a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; the template format can be written as known from step C2

represents E₁The left 3-vocabulary length vector representation,

represents E₁，E₂The vector representation of the vocabulary in between,

and a template 2:

if the condition is satisfied

Same, then template P₁And a template P₂Can be determined by

Calculated as mu₁μ₂μ₃Are weighted because

Then template P₁And a template P₂The similarity of (a) can be recorded as 0;

the averaging method is to average the vectors of the templates in the same class to generate a new template, which can refer to the snowball algorithm of relational extraction-programmer's bibliographic works.

Step C3., similarity calculation is carried out on the templates of the event tuples obtained in the step C1 and the templates in the rule base one by one, the templates with the similarity smaller than the threshold value of 0.7 are discarded, and the events in the templates with the similarity larger than the threshold value of 0.7 are added into the log key event relation table to replace the fault event relation table;

c4, repeating the steps C1-C3 until no template which can be discarded exists after the treatment of the step C3;

step R2 marks the fault log with each event relationship generated in step C4 as a log key event label.

As shown in fig. 1, the times of occurrence of each log key event label per minute is used as a monitoring index to establish each log KPI curve, and a gaussian kernel is used to smooth each log KPI curve;

step A9, marking according to the periodic classification of the KPI curves of the log;

carrying out periodic verification and inspection on the log KPI curves of each event relation, and marking the log KPI curves subjected to Gaussian kernel smoothing treatment according to periodic difference of the log KPI curves, wherein the labels are called log KPI curve period labels;

the step D1 periodic validation check includes the steps of:

and Z03, setting a hypothetical period, namely a waiting period, carrying out correlation strength detection on the waiting period if and only if the length of the waiting period is within the range of 95-105% of the expected period, and identifying the waiting period as a period meeting the requirement if the spectrum strength is sufficient.

Step A10 marking according to similarity classification of log KPI curves;

each log KPI curve mutually uses NCC algorithm to calculate pairwise similarity, and expands into a diagonal similarity matrix, and fills the similarity into the similarity matrix, wherein the serial numbers of rows and columns in the matrix are the serial numbers of the log KPI curves, the number of rows and columns in the similarity matrix is the number of the log KPI curves, and the numerical value in the similarity matrix is the similarity between the log KPI curves;

using a spectral clustering algorithm to mark different log KPI curve labels with clusters according to the similarity matrix to obtain a mapping relation (service implicit relation) of the log key event labels;

"https:// zhuanlan. zhihu. com/p/29849122" describes a classification method for spectral clustering.

Step A11 the KPI curve obtained in step A10 was pre-processed as in example 4.

Example 2

The method for marking waveband characteristics of the log KPI curve obtained based on the embodiment 1 comprises the following steps:

step A1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets with the time width of s minutesM _iI is the segment number;

step A2, calculating Euclidean distances among all the sections of data sets according to the attributes of all the sections of log KPI curve data sets by using a dbscan algorithm, clustering the log KPI curve data sets of the sections i to obtain k clusters and abnormal items, wherein each cluster is a grouped data set, and each grouped data set has j sections of log KPI curve data setsF _j；

Step A3, calculating the arithmetic mean value, sigma, of j sections of log KPI curve data sets in each grouped data setF _j /jAs the fundamental wave of the packet;

step A4, calculating each section of log KPI curve data set of each grouped data set by using NCC algorithmF _jThe waveform similarity with the fundamental wave is sorted from big to small, and the log KPI curve data sets with the waveform similarity sorted to the first 95 percent are recordedF _jTaking the minimum value of the waveform similarity as the grouping boundary line of the groupB _k；

Step A5, calculating each log KPI curve data set by using NCC algorithmM _iWaveform similarity with fundamental wave of each groupNCC _{M i-J k}Judging whether each section of log KPI curve data set belongs to the group boundary line of each group as the referenceThe grouping is based on classification scores for a segment of log KPI curve data set belonging to multiple groups simultaneouslyQSorting is carried out, and log KPI curve data sets are obtainedM _iGrouping to categorical scoreQIn the minimum grouping, the grouping information of each log KPI curve data set is obtained,

Q=((1-NCC _{M i-J k})/(1-B _k))²；

NCC _{M i-J k}the larger the size of the tube is,Qthe smaller theM _iThe more similar to cluster class k, the current log KPI curve datasetM _iSimilarity to different clustersNCC _{M i-J k}When the phase of the mixture is the same as the phase of the mixture,B _ksmaller indicates the clusterM _iSimilarity to cluster kNCC _{M i-J k}The more advanced in the waveform similarity ranking in the cluster class; by means of this formula the log KPI curve data set can be calculatedM _iThe likelihood among the candidate clusters, and thus which cluster is most likely to be.

A6, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;

step A7., performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;

event trigger interval, namely the time interval of two adjacent log KPI curve data sets in each grouped data set;

step A8., merging the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to NCC; if the time interval KPI sets of different clusters are similar, the waveforms of the clusters are similar in the total time width;

step A9., expanding the similarity of the time interval KPI sets among the clusters obtained in step A8 into a similarity matrix; as shown in table 1, a to d are serial numbers of clusters, the number of rows and columns of the similarity matrix is the number of clusters, the numerical value in the similarity matrix is the similarity of the time interval KPI sets between clusters, and the similarity matrix is a diagonal matrix;

step A10, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;

step A11, replacing the similarity value of which the value is greater than the inflection point in the similarity matrix with 1, and replacing the similarity value of which the value is less than the inflection point with 0, as shown in Table 2;

step A12, marking the similarity of 1 in the similarity matrix obtained in the step A11 and adjacent clusters as the same similar group, and counting the cluster number of each similar group;

step A13, calculating the total time interval of a group with the most clusters in the similarity group as the width of a sliding window;

setting the total time interval as the width of a sliding window, and dividing the log KPI curve into a plurality of segments by using the window, wherein the time width of each segment covers the similarity group with the maximum time length obtained in the substep S12. The sliding window is used for scanning the log KPI curve, the continuously appeared clusters can be quickly divided into a window and then quickly clustered to the same waveform category, the calculated amount is reduced, the wave bands of the log KPI curve can be integrally classified, and the possibility of missing knowledge is reduced.

The above-mentioned NCC (normalized cross correlation) algorithm is defined as:

in the formula, x_tAs a background waveform, y_t+hThe value of NCC is between-1 and 1, wherein, -1 represents that the waveforms before and after transformation are opposite, 0 represents that the two waveforms are orthogonal, and 1 represents the same. The NCC only describes the macroscopic similarity degree of the two waveforms, and is not related to the amplitude of the waveforms and the energy attenuation.

Step A14, firstly, according to the sliding window obtained in the step A13, each log KPI curve obtained in the step F10 is divided into a plurality of log KPI curve window sections with the time sequence width as the total time interval, and according to the dividing method in the step A1, the log KPI curve window sections are divided into i-section log KPI curve data sets with the time sequence width of 1 minute

Each segment is a band;

and B, carrying out similarity calculation on the fundamental waves obtained in the step A2 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain

And sorting from large to small, in the wave band whose waveform similarity is sorted to top 95%, taking the minimum value of waveform similarity as group boundary line B of said group^’ _kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference

Whether belonging to the group or not, for a segment of log KPI curve data set simultaneously belonging to a plurality of groups

Score according to classification

In the smallest groups, forming fundamental tags as in FIG. 2The formed label chain obtains the mode waveforms of different KPIs, which are called KPI curve code pattern rearrangement table,

；

the tag information obtained after the processing in step a14 contains all information of all bands, including two parts of band and waveform representation, the band tag has a fundamental wave type, and the waveform tag has two types, namely a service tag and a period tag.

In this way, each time a window is slid on a log KPI curve, one band chain is obtained, all band chains are equal in length, and only the sorting labels of the bands are different, in this embodiment, the curve characteristics of the log KPI curves of different monitoring indexes having a relationship are converted into the label chain sorting characteristics, and due to the relationship, although the amplitudes of the log KPI curves are different, the periods are similar to each other, the rhythm is similar, that is, the labels are arranged, so that a large number of KPI curves having a relationship can be unified into a standard and consistent label chain.

And A15, placing different KPI curve code pattern rearrangement tables in one dimension in a unified time dimension to obtain a KPI curve code pattern rearrangement association table.

If different log KPI curves use the same log KPI curve traffic label, there may be causal relationships where there is a higher probability of belonging to an aperiodic log KPI than to a periodic log KPI curve.

Different log KPI curves may have causal relationships if the same log KPI curve pattern fundamental signature is present in adjacent time segments, with a higher probability for more repetitions.

After all tag chains are arranged according to the time dimension, the sequence mining algorithm SPADE or GSP can be used for discovering the causal relationship between different tag chains occurring at different times, if two events always occur in pairs, the two events are considered to be related, and if one event always occurs before the other event, the causal relationship and the pre-causal effect between the two events are considered. The method is beneficial to supplementing a knowledge system for fault determination in the system by experts and discovering the incidence relation of monitoring indexes which are not discovered before, so that a new early warning control relation and a regulation and control threshold value can be established based on the incidence relation between the newly discovered monitoring indexes in operation, and the system stability of each monitored object in the same system is improved.

Claims

1. A method for generating KPI curves based on log event relations comprises the following steps:

step F2., performing word segmentation on the residual corpus in the step F1, generating a word segmentation queue consisting of a plurality of characteristic words, and labeling part of speech of the plurality of characteristic words to obtain a part of speech queue of the corpus;

f4. classifying the residual corpus according to the label of F3, counting the occurrence frequency of each category part-of-speech queue, sorting in descending order, selecting the part-of-speech queue with the sequence greater than a second threshold value, counting the occurrence frequency of each verb and noun in each category part-of-speech queue, sorting in descending order, sequentially screening two parts-of-speech queue sets with the top rank from the sequence according to the occurrence frequency of verbs and nouns according to the sorting threshold value, extracting the corpus corresponding to the intersection of the two parts-of-speech queue sets, and constructing a true training set;

step F5., screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, wherein n represents the part-of-speech of a noun, v represents the part-of-speech of a verb, and extracting a first participle and a second participle with parts-of-speech of a noun or a proper noun from the participle queue as an event I and an event II respectively to form an event tuple;

step F6, based on the existing fault event relation table, using Snowball algorithm to find the event association rule of the event tuple, and finding the association event group in the event tuple according to the event association rule, namely generating a log key event relation table;

step F7, repeatedly using the step F6 based on the log key event relation table until convergence;

step F8., taking each event relation generated in step F7 as a log key event label mark fault log, taking the times of occurrence of each log key event label per minute as a monitoring index, establishing each log KPI curve, and using a Gaussian kernel to smoothly process each log KPI curve;

step A7., marking adjacent clusters with numerical values larger than the inflection point in the similarity matrix as the same similar group, and counting the cluster number of each similar group;

Each segment is a band;

2. The method according to claim 1, wherein the calculating of the similarity in step F1 includes the steps of: respectively segmenting the sentences in the sentence pairs based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;

3. The method of claim 2, further comprising, after steps F8 and a 1:

4. The method of claim 3, wherein step Z03 is further followed by:

the similarity matrix is filled with the similarities, the serial numbers of rows and columns in the matrix are the numbers of the log KPI curves, and the number of rows and columns of the similarity matrix is the number of the log KPI curves;

5. The method according to claim 1, wherein step F6 includes:

c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is an arbitrary set length, < left > is a vector representation of len words on the left side of the event 1, < middle > is a vector representation of words between the event 1 and the event 2, and < right > is a vector representation of len words on the right side of the event;

c2. clustering the generated templates, clustering the templates with similarity greater than the threshold value three into a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; the template format known from step C2 is recorded as

represents E₁The left 3-vocabulary length vector representation,

represents E₁、E₂The vector representation of the words in between,

and (3) a template 2:

if the condition is satisfied

I.e. satisfy the template P₁Event 1 type E of₁And a template P₂Type of event 1

Same, then template P₁And a template P₂Is similar to

Calculated as mu₁μ₂μ₃Are weighted because

Setting mu with great influence on the calculation result of the similarity between the pair of templates₂>μ₁>μ₃(ii) a If the condition is not satisfied

Then template P₁And a template P₂The similarity of (2) is recorded as 0;

6. The method of claim 1, wherein step a1 comprises the steps of:

step J1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets M with the time width of s minutes_iI is the segment number;

j2. calculating Euclidean distance between each segment of data sets according to the attribute of each segment of log KPI curve data set by using dbscan algorithm, clustering the log KPI curve data sets of i segments to obtain k clusters and abnormal items, wherein each cluster is a packet data set, and each packet data set has j segments of log KPI curve data sets F_j；

Step J3. calculates an arithmetic mean value Σ F for j segments of log KPI curve data sets in each packet data set_j(j) as the fundamental wave of the group;

step J4. uses the NCC algorithm to compute the log KPI curve data sets F for each packet data set_jThe waveform similarity with the fundamental wave is sorted from big to small, and a log KPI curve data set F with the waveform similarity sorted to the first 95 percent is obtained_jTaking the minimum value of the waveform similarity as the valueGrouping boundary line B of groups_k；

Step J5. uses the NCC algorithm to calculate each segment of the log KPI curve data set M_iWaveform similarity with fundamental wave of each group

Judging whether each section of log KPI curve data set belongs to the group or not by taking the group boundary line of each group as a reference, sequencing one section of log KPI curve data set simultaneously belonging to a plurality of groups according to the classification score Q, and sequencing a log KPI curve data set M_iGrouping the data into groups with the minimum classification score Q to obtain grouping information of each log KPI curve data set,

。

7. the method of claim 1, wherein step a7 is replaced with: replacing the similarity value with the value larger than the inflection point in the similarity matrix with 1, and replacing the similarity value with the value lower than the inflection point with 0;

8. The method of claim 1, wherein the step of segmenting the KPI curve window segment into bands in step a9 is: and B, carrying out similarity calculation on the fundamental waves obtained in the step A1 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain

Score according to classification

。

9. the method of claim 1, wherein after all tag chains are arranged according to the time dimension, causal relationships between different tag chains occurring at different times are discovered based on a sequence mining algorithm SPADE or GSP.