CN114398898A - Method for generating KPI curve and marking wave band characteristics based on log event relation - Google Patents

Method for generating KPI curve and marking wave band characteristics based on log event relation Download PDF

Info

Publication number
CN114398898A
CN114398898A CN202210292597.6A CN202210292597A CN114398898A CN 114398898 A CN114398898 A CN 114398898A CN 202210292597 A CN202210292597 A CN 202210292597A CN 114398898 A CN114398898 A CN 114398898A
Authority
CN
China
Prior art keywords
log
kpi
similarity
event
curve
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210292597.6A
Other languages
Chinese (zh)
Other versions
CN114398898B (en
Inventor
戴曦
尹立超
徐旭朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Three Gorges Zhikong Technology Co ltd
Original Assignee
Three Gorges Zhikong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Three Gorges Zhikong Technology Co ltd filed Critical Three Gorges Zhikong Technology Co ltd
Priority to CN202210292597.6A priority Critical patent/CN114398898B/en
Publication of CN114398898A publication Critical patent/CN114398898A/en
Application granted granted Critical
Publication of CN114398898B publication Critical patent/CN114398898B/en
Priority to PCT/CN2023/082359 priority patent/WO2023174431A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for generating a KPI curve based on a log event relation and marking waveband features, which comprises the steps of firstly generating a log KPI curve according to the relation of events in a log, then dividing the KPI curve into a plurality of wavebands with equal lengths, clustering the wavebands into a plurality of clusters according to the non-time dimension of the wavebands, extracting the fundamental wave of each cluster, comparing the similarity of each waveband data of each cluster and the fundamental wave, finding out the grouping boundary line of each cluster, grouping each waveband data of each cluster, extracting the total time length of continuous similar wavebands in each cluster, and taking the maximum value of the total time length as the width of a sliding window. The window is used for dividing the KPI curve, so that the divided wave bands in each window are easy to cluster and classify, the whole KPI curve is favorably and rapidly divided into wave band chains consisting of different types of wave bands, then the KPI curve of an individual monitoring index is subjected to periodic detection and type detection marking, the individual KPI curve is divided by using the window, and the wave bands in the fundamental wave KPI curve are subjected to grouping and labeling.

Description

Method for generating KPI curve and marking wave band characteristics based on log event relation
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method for generating a KPI curve and marking wave band characteristics based on log event relations.
Background
Outlier detection (also known as outlier detection) is a detection process that finds objects whose behavior is different from that of the expected objects, which are called outliers or outliers. The anomaly detection means generally includes a statistical-based model, a distance-based model, a linear-transformation model, a nonlinear-transformation model, a machine-learning model, and the like.
Kpis (key performance indicators) refer to monitoring metrics (e.g., delay, throughput, etc. in a network) for objects such as services, systems, etc. The storage form is a sequence formed by arranging the occurrence time sequence, namely a time sequence which is generally called. The abnormal detection of the time series is to check whether the current data is obviously deviated from the normal condition through historical data analysis. KPI data anomaly detection has very important meaning: through real-time monitoring of KPI data, the abnormality of KPI data is discovered, and corresponding processing is carried out in time, so that the normal operation of the application is ensured.
Methods for performing real-time anomaly detection by setting a threshold value for KPI data are quite common, but methods for performing real-time anomaly detection for system logs have not been reported publicly.
In order to pursue effectiveness, a supervision learning mode is mostly adopted in traditional machine learning, abnormal labels are difficult to obtain in batches in practice, accuracy of model output is improved through massive labeled data samples, therefore, a large number of business experts are needed to manually label KPI curves, repeated adjustment and correction are often needed, time and labor are consumed, in practice, millions and tens of millions of KPIs (key performance indicators) may need to be monitored at the same time, therefore, an algorithm cannot be found in practical abnormal detection practice, the requirements can be met at the same time, and the above challenges cannot be solved at the same time. The unsupervised learning common clustering technology and the like are mainly used for scenes such as feature discovery, data exploration and the like, and because of lack of labels, the result can be abstractly mapped to a business mode only by being interpreted by a data scientist, and the result cannot be directly acted; in the specific implementation of weak supervision, due to the introduction of an unsupervised/supervised method in stages, the accuracy of circular recursion is improved, the method is too academic and difficult to fall on the ground, and on the other hand, in order to fuse specific methods, vector expression is required to be adopted to unify the representation among different methods, so that the result is difficult to understand by application personnel.
The more the data volume is, the more complex the service scene is, the more complex the introduction manner is, and the more diversified the investment cost/manpower is required. The circulation directly limits the popularization of machine learning in the whole industry, and focuses on the industry with higher income, so that the conventional industry only adopts abandon resistance and passive defense, and flows backwards depending on the average level of the whole industry, and the migration of a service scene is realized, and the method specifically comprises the following steps: if a method is particularly effective in other industries, the person is left with the surplus to borrow the observation effect, if feasible, to use. One such industry of passive defense is an industrial application scenario.
Disclosure of Invention
The first purpose of the invention is to provide a method for generating KPI curves and marking wave band characteristics based on log event relations, which processes text logs generated by monitoring indexes in an industrial control system, combines highly correlated events into a same group, and generates log KPI curves periodically correlated with the KPI curves of the monitored indexes.
The technical scheme of the invention is as follows: a method for generating KPI curves based on log event relations comprises the following steps:
step F1, setting a training sentence subset consisting of training sentences, obtaining a fault log by the industrial control equipment in the same industrial control system based on monitoring indexes, forming a sentence pair to be processed by the corpora in the fault log and each training sentence respectively, calculating the similarity, and deleting the corpora with the similarity lower than a threshold value one;
step F3., performing word segmentation on the residual corpus in the step F2, generating a word segmentation queue consisting of a plurality of characteristic words, and labeling part of speech of the plurality of characteristic words to obtain a part of speech queue of the corpus;
step F3., if the part-of-speech queue contains a plurality of special feature words corresponding to special parts-of-speech, obtaining the boundary and category of the named entity from the plurality of special feature words by using the named entity recognition model, updating the parts-of-speech of the special feature words in the part-of-speech queue to the boundary and category of the named entity, and obtaining an updated part-of-speech queue, wherein the special parts-of-speech includes: number word, time word;
step F4., classifying the remaining corpora according to the label of the remaining corpora of F3, counting the occurrence frequency of part-of-speech queues of each category, sorting in descending order, selecting part-of-speech queues with the sorting greater than a threshold two, and counting various parts-of-speech queues of each category: the occurrence frequency of verbs and nouns is subjected to descending order, two parts of speech queue sets with the top rank are sequentially screened out from the two orders according to an ordering threshold, the corpus corresponding to the intersection of the two parts of speech queue sets is extracted, and a true training set is constructed;
step F5., screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, wherein n represents the part-of-speech of a noun, v represents the part-of-speech of a verb, and extracting first and second participles with parts-of-speech of a noun and a proper noun as an event first and an event second respectively to form an event tuple;
step F6., based on the existing fault event relationship table, using Snowball algorithm to find the event association rule of the event tuple, and finding the association event group in the event tuple according to the event association rule, i.e. generating a log key event relationship table;
step F7. repeats using step F6 based on the log key event relationship table until convergence;
step F8., using each event relation generated in step F7 as a log key event label to mark a fault log, using the frequency of each log key event label appearing per minute as a monitoring index, establishing each log KPI curve, and using a gaussian kernel to smooth each log KPI curve.
Advantageously, the same industrial control system is composed of industrial control devices which have a direct or indirect material supply relationship, an electric energy transfer relationship, a thermal energy transfer relationship, a mechanical energy transfer relationship, a magnetic field transfer relationship, an energy conversion relationship or a signal control relationship, the industrial control devices in the same industrial control system obtain fault logs based on monitoring indexes, the fault logs also have relevance due to the fact that the monitoring indexes have relevance, each record of the monitoring indexes in the logs has partial text difference, direct clustering needs a large amount of manual indexing and screening work, log texts describing behaviors or states of the devices or the devices have similar sentence text structures and similar part-of-speech queue characteristics, texts of the similar part-of-speech queues are screened out in steps F1-F4, and log texts which are not used for recording the behaviors or states of the devices or the devices are eliminated; nouns and nouns in the text often have a specific associated logical relationship, and highly related event relationships can be combined into the same group according to the relationship to generate a log KPI curve periodically related to the KPI curve of the monitored indicator.
Further, the calculating of the similarity in step F1 includes the steps of: respectively segmenting the sentences in the sentence pairs based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;
and converting each characteristic word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold value one.
Further, step F8 is followed by:
extracting a frequency spectrum intensity graph of a log KPI curve by using Fourier transform;
z02, extracting the point with the highest vibration amplitude and calculating the corresponding period, namely the period to be checked;
and Z03, setting an assumed period, namely a waiting period, detecting the correlation strength of the period to be detected if and only if the length of the period to be detected is within the range of 95-105% of the expected period, determining the period to be detected as a period meeting the requirement if the spectrum strength is sufficient, and marking a filtered log KPI curve according to the periodic difference of the log KPI curves, namely a log KPI curve period label.
The period inspection is to mark the waveform with periodic and non-periodic marks, the periodic marks represent that regular and repeated events exist, and the information usually means service information such as state detection on service knowledge and rotating parts; relatively non-periodic in contrast means event traffic. They are all service tags used in other steps and are not related to other operations; the similarity of the periodic KPIs is probably because of similarity relations formed for various reasons, no business relation exists, and the non-periodic KPIs are more probably that direct and indirect relations exist.
Further, step Z03 is followed by:
the similarity matrix is filled with the similarities, the serial numbers of rows and columns in the matrix are the numbers of the log KPI curves, and the number of rows and columns of the similarity matrix is the number of the log KPI curves;
and Z05, outputting different clusters according to the similarity matrix by using a spectral clustering algorithm, and marking different log KPI curve labels, called KPI curve service labels, for the different clusters.
Advantageously, the KPI curves are clustered and classified according to overall similarity of the KPI curves to form clusters with similar waveforms.
Further, step F6 includes:
c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is a length which can be set arbitrarily, < left > is a vector representation of len words on the left side of the event 1, < middle > is a vector representation of words between the event 1 and the event 2, and < right > is a vector representation of len words on the right side of the event;
c2. clustering the generated templates, clustering the templates with similarity greater than the threshold value three into a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; the template format can be written as known from step C2
Figure 595177DEST_PATH_IMAGE001
,E1、E2Respectively indicating an event 1 type and an event 2 type of the template P,
Figure 137760DEST_PATH_IMAGE002
represents E1The left 3-vocabulary length vector representation,
Figure 130249DEST_PATH_IMAGE003
represents E1,E2The vector representation of the vocabulary in between,
Figure 989269DEST_PATH_IMAGE004
represents E2Vector representation of three word lengths on the right, similarity calculation between templates, template 1:
Figure 861017DEST_PATH_IMAGE005
and a template 2:
Figure 787384DEST_PATH_IMAGE006
if the condition is satisfied
Figure 867336DEST_PATH_IMAGE007
I.e. satisfy the template P1Event 1 type E of1And a template P2Event 1 type of
Figure 689798DEST_PATH_IMAGE008
Identical and template P1Event 2 type E of2And a template P2Event 2 type of
Figure 487990DEST_PATH_IMAGE009
Same, then template P1And a template P2Can be determined by
Figure 218049DEST_PATH_IMAGE010
Calculated as mu1μ2μ3Are weighted because
Figure 152507DEST_PATH_IMAGE011
The calculation result of the similarity between the templates is greatly influenced, and mu can be set213(ii) a If the condition is not satisfied
Figure 942608DEST_PATH_IMAGE007
If the similarity between the template P1 and the template P2 is 0;
step C3., similarity calculation is carried out on the event tuple templates obtained in the step C1 and the templates in the rule base one by one, the similarity is abandoned if the similarity is smaller than the threshold value three, and the events in the templates with the similarity larger than the threshold value three are added into the log key event relation table to replace the fault event relation table.
The invention also aims to provide a method for marking waveband characteristics by the KPI curve, which comprises the steps of dividing the KPI curve into a plurality of wavebands with equal length, clustering into a plurality of clusters according to the non-time dimension of the wavebands, extracting the fundamental wave of each cluster, comparing the similarity between each waveband data of each cluster and the fundamental wave, finding out the grouping boundary line of each cluster, grouping each waveband data of each cluster, extracting the total time length of continuous similar wavebands in each cluster, and taking the maximum value of the total time length as the width of a sliding window. The window is used for partitioning the log KPI curve, so that the wave bands in each partitioned window are easy to cluster and classify, the whole log KPI curve is favorably and rapidly divided into wave band chains consisting of different types of wave bands, then the log KPI curve of an individual monitoring index is subjected to periodic detection and type detection marking, the individual log KPI curve is partitioned by using the window, and the wave bands in the log KPI curve are subjected to grouping and labeling by using fundamental waves.
The method for marking the wave band characteristics of the KPI curve obtained by the method comprises the following steps:
step A1, merging data points of all minutes in all log KPI curves, dividing the data points into a plurality of band segments with time width of s minutes, clustering the band segments into a plurality of clusters according to non-time dimensions of the band segments, extracting fundamental waves of all the clusters, comparing similarity between band data of all the clusters and the fundamental waves, finding out grouping boundary lines of all the clusters, and grouping the band data of all the clusters;
a2, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;
step A3, performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;
step A4, combining the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to NCC;
step A5, expanding the similarity of the time interval KPI sets among the clusters obtained in the step A4 into a similarity matrix;
a6, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;
step A7., marking adjacent clusters with numerical values larger than the inflection point in the similarity matrix as the same similar group, and counting the cluster number of each similar group;
step A8., calculating the total time interval of the group with the most clusters in the similarity group as the width of the sliding window;
step A9. is to divide each log KPI curve into several log KPI curve window segments with time sequence width as total time interval according to the sliding window obtained in step A8, and to divide the log KPI curve window segments into i-segment log KPI curve data sets with time sequence width of 1 minute according to the dividing method in step A1
Figure 995140DEST_PATH_IMAGE012
Each segment is a band;
comparing the similarity of each fundamental wave obtained in the step A1 with each wave band in each window of each log KPI curve one by one, sequencing the similarity from large to small, finding out grouping boundary lines according to the sequencing, grouping the wave bands to form a label chain formed by fundamental wave labels, and acquiring mode waveforms of different KPIs, wherein the mode waveforms are called KPI curve code pattern rearrangement tables;
and A10, placing different KPI curve pattern rearrangement tables in one dimension in a time dimension to obtain a KPI curve pattern rearrangement association table.
Advantageously, the label information obtained after the log KPI curve is processed contains all information of all bands, including two parts of band and waveform representation, the band labels are the fundamental wave type and the time arrangement information of the fundamental wave label, and the waveform label includes two kinds of service labels and period labels.
Different KPI curves may have causal relationships if the same KPI curve traffic label is used, where a KPI belonging to an aperiodic KPI has a higher probability than a periodic KPI curve.
Different KPI curves may have causal relationships if the same KPI curve pattern fundamental signature is present in adjacent time segments, with a higher probability for more repetitions.
Further, step a1 includes the following steps: step J1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets M with the time width of s minutesiI is the segment number;
j2. calculating Euclidean distance between each segment of data sets according to the attribute of each segment of log KPI curve data set by using dbscan algorithm, clustering the log KPI curve data sets of i segments to obtain k clusters and abnormal items, wherein each cluster is a packet data set, and each packet data set has j segments of log KPI curve data sets Fj
Step J3. calculating an arithmetic mean value Σ Fj/j of j log KPI curve data sets in each group data set as a fundamental wave of the group;
step J4. uses NCC algorithm to calculate the waveform similarity between each section of log KPI curve data set Fj of each packet data set and the fundamental wave, and arranges from big to small, and the log KPI curve data set F with the waveform similarity ordering of the first 95 percentjTaking the minimum value of the waveform similarity as the grouping boundary line B of the groupk
Step J5. uses the NCC algorithm to calculate each segment of the log KPI curve data set MiWaveform similarity NCC with fundamental wave of each groupMi-JkJudging whether each section of log KPI curve data set belongs to the group or not by taking the group boundary line of each group as a reference, sequencing one section of log KPI curve data set simultaneously belonging to a plurality of groups according to the classification score Q, and sequencing a log KPI curve data set MiGrouping into the group with the smallest classification score Q to obtainGrouping information of each log KPI curve data set,
Q=((1-NCCM i-Jk)/(1-Bk))2
further, step a7 is replaced with: replacing the similarity value with the value larger than the inflection point in the similarity matrix with 1, and replacing the similarity value with the value lower than the inflection point with 0;
and marking the similarity in the obtained similarity matrix as 1 and adjacent clusters as the same similar group, and counting the cluster number of each similar group.
Further, the step of dividing the KPI curve window segment into bands in step a9 is: and B, carrying out similarity calculation on the fundamental waves obtained in the step A2 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain
Figure 997731DEST_PATH_IMAGE013
And sorting from large to small, and taking the minimum value of the waveform similarity as a grouping boundary line B 'of the grouping in the wave band with the waveform similarity sorting of the first 95 percent'kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference
Figure 984098DEST_PATH_IMAGE012
Whether belonging to the group or not, for a segment of log KPI curve data set simultaneously belonging to a plurality of groups
Figure 538576DEST_PATH_IMAGE012
Score according to classification
Figure 576940DEST_PATH_IMAGE014
Sequencing is carried out, and a log KPI curve data set M is obtainediGrouping to categorical score
Figure 678494DEST_PATH_IMAGE014
In the minimum grouping, a label chain formed by fundamental wave labels is formed, mode waveforms of different KPIs are obtained, the mode waveforms are called KPI curve code pattern rearrangement tables,
Figure 853124DEST_PATH_IMAGE015
further, after all tag chains are arranged according to the time dimension, causal relationships among different tag chains occurring at different times are discovered based on a sequence mining algorithm SPADE or GSP.
Specific nouns in the text of the fault log generated by the industrial control equipment of the same industrial control system have mutual causal influence, and are shown in the way that paired nouns synchronously appear due to the same inducement, similar noun queues can be classified into one class, namely, the event relationship obtained in step F8, and the frequency obtained by counting the event relationship can obtain a log KPI curve, and the log KPI curve appears together with an index KPI curve obtained by monitoring the physical parameter analog quantity by the industrial control equipment, so that the index KPI curve can be classified and clustered into a band chain with tag sorting characteristics, and therefore, the log KPI curve also has the same band chain characteristics, and the band chain characteristics of the index KPI curves generated by the same inducement for different physical parameters are similar, and the band chain characteristics of the log KPI curves generated by the same inducement for different event relationships are also similar.
In order to find the waveband chain, a sliding window with a proper width is adopted to slide along a log KPI curve, a log KPI curve unit segment is intercepted from the window, a plurality of equal-length wavebands extracted from the log KPI curve unit segment are marked based on the similarity of characteristic fundamental waves and the wavebands, so that the log KPI curve unit segment becomes a waveband chain with label sorting characteristics, thus, each time the window is slid on the log KPI curve, a waveband chain is obtained, all the waveband chains are equal in length, only the classification labels of the wavebands are sorted differently, based on the difference of the sorting characteristics of the waveband chains, after all the waveband chains obtained through the sliding window are arranged according to the time dimension, the causal relationship of the waveband chains with different characteristics on the time dimension can be obtained based on sequence mining algorithm SPADE, expert evaluation and knowledge map fusion, the causal relationship between the event relationship and the event relationship is obtained, and the method is helpful for supplementing a knowledge system for fault determination in the system by experts and finding the incidence relationship of monitoring indexes which are not found before, so that a new early warning control relationship and a new regulation and control threshold value can be established based on the incidence relationship between the newly found monitoring indexes in operation, and the system stability of each monitored object in the same system is improved.
The technical problem solved by the invention is similar to the feature compression code obtained by inputting the waveform to the self-coding network in the prior art CN110726898B, CN110726898B, and is equivalent to extracting the wave band chain based on the KPI curve or inducing the event tuple based on the fault log in the invention. Inputting the compressed codes into a classification model to obtain the type of the fault waveform, which is equivalent to the causal relationship of the band chain with different characteristics on the time dimension, which can be obtained based on the sequence mining algorithm SPADE, expert evaluation and knowledge map fusion of the invention; or just as entering event tuples into an existing fault event relationship table (classification model) and classifying the event tuples into associated event groups based on Snowball.
Drawings
FIG. 1 is a log KPI curve generated from fault logs generated by industrial control equipment in the same industrial control system;
FIG. 2 is a label chain of formed fundamental wave labels;
fig. 3 shows the categories of the log KPI curves generated from the fault log text and clustered.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention. In the following embodiments, the label chain and the band chain are the same meaning, and the KPI curve unit segment and the KPI curve window segment are the same meaning. The same industrial control system is composed of industrial control devices which have a direct or indirect material supply relationship, an electric energy transfer relationship, a heat energy transfer relationship, a mechanical energy transfer relationship, a magnetic field transfer relationship, an energy conversion relationship or a signal control relationship, the industrial control devices in the same industrial control system obtain fault logs based on monitoring indexes, and the fault logs also have correlation due to the fact that the monitoring indexes have correlation.
Example 1
A method for generating KPIs based on log keyword clustering comprises the following steps:
r1 collects fault logs obtained by industrial control equipment in the same power station industrial control system network based on monitoring indexes, event tuples are constructed according to the fault logs, and the fault logs are processed by a snowball algorithm to construct event relations.
The method for constructing the event tuple comprises the following steps:
f1 setting a training sentence subset composed of training sentences, extracting corpora from the fault log to respectively form a sentence pair to be processed with each training sentence, and respectively segmenting the sentences in the sentence pair based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;
f2, converting each feature word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold, wherein the threshold is set to be 0.9;
steps F1-F2 are used for picking out grammars from the fault logs, wherein the semantic structures are sentences used for referring, behavior recording and state description, and the general grammars of the fault logs in the industrial control system are as follows: the description structure of the sentence is less ambiguous, which is beneficial to removing error logs in fault logs and keeping industrial record logs;
segmenting the corpus by using a jieba.
def cut( sentence, cut_ all=False, HMM=True)
Wherein, the sensor is a sentence sample needing word segmentation; cut _ all is a word segmentation mode, jieba word segmentation has a full mode and an accurate mode, and is selected by true and false respectively, and the default is false, namely the accurate mode; HMMs are hidden markov chains that are used in theoretical models of word segmentation, which are turned on by default.
F3 performing word segmentation on the residual corpus in the step F2, forming a word segmentation queue by a plurality of characteristic words, and labeling part of speech on the plurality of characteristic words to obtain a part of speech queue of the corpus;
annotating part of speech a jieba. The Yangjun describes the use steps and the part-of-speech classification table of the jieba.
F4, if the part of speech queue contains a plurality of special feature words corresponding to special parts of speech, obtaining the boundary and the category of a named entity from the special feature words by using a named entity recognition model, and updating the parts of speech of the special feature words in the part of speech queue into the boundary and the category of the named entity to obtain a part of speech queue;
wherein, the special part of speech includes: the method comprises the following steps of (1) counting words and time words, wherein only numerical values and time are classified by parts of speech in the application scene of the embodiment, so that inaccurate identification is easy to occur; for example, in FIG. 3, pulsing the "16: 10:23 (set I) signal for the corpus" allows "the participle to obtain a part-of-speech queue," {16: m,: x,10: m,: x,23: m, (: x, set I: n,): x, signal: n,: v, pulse: n, allow: v } ", where: m represents a number, x represents a character string, n represents a noun, and v represents a verb. The part-of-speech queue obtained after processing in step F4 is: the step of {16:17:00: t, (: x, set i: n,): x, signal: n, occurrence: v, another channel: n, reception: v } ", avoids the part of speech of the time word which is difficult to identify being marked as a digit, so that the queue containing the time word and the queue containing the digit can be distinguished through the part of speech queue.
The named entity recognition model can recognize named reference items from the linguistic data to be processed. In a narrow sense, four types of named entities, namely, a name of a person, a name of a place, a name of an organization and a name of a proper noun, are identified. It generally comprises two parts: (1) identifying entity boundaries; (2) entity categories (person name, place name, organization name, or others) are determined. There are a variety of ways to identify named entities, such as: the named entity recognition model may be constructed based on the above-described methods, such as rule-based methods, feature template-based methods, neural network-based methods, and the like.
For example: the named entity recognition model (CRF) carries out entity annotation on a sentence that I comes to the Tujia village, and the result after the accurate annotation is as follows: I/O to/O ceramic/B Home/M village/E (O means that the current word is not a geographical named entity, B ME means that the current word is the head inner tail of the geographical named entity, respectively). To solve this problem, a linear chain CRF is used, then (O, O, O, B, M, E) is one of its tagging sequences and (O, O, O, B, M, E) is also one of its tagging choices.
F5 classifies the residual corpus according to the label of F4 on the residual corpus, counts the occurrence frequency of part-of-speech queues of each category, and counts various types in the part-of-speech queues of each category: frequency of occurrence of verbs and nouns;
f6, sorting the part-of-speech queues of each category in a descending order according to the occurrence frequency of each verb and noun, sequentially screening two part-of-speech queue sets with the top rank from the two sorts according to a sorting threshold, extracting the corpus corresponding to the intersection of the two part-of-speech queue sets, and constructing a true training set;
f7, screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, and extracting a first participle and a second participle with parts-of-speech being nouns or proper nouns from the participle queue as an event I and an event II respectively to form an event tuple;
f8, finding the event association rule of the event tuple by using the Snowball algorithm, finding the association event group in the event tuple according to the event association rule:
c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is a length which can be set arbitrarily, < left > is a vector representation of len words on the left side of the event 1, < middle > is a vector representation of words between the event 1 and the event 2, and < right > is a vector representation of len words on the right side of the event;
c2. clustering the generated templates, clustering the templates with similarity greater than 0.7 as a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; from step C2, the mold is obtainedThe format of the board can be described as
Figure 719449DEST_PATH_IMAGE001
,E1、E2Respectively indicating an event 1 type and an event 2 type of the template P,
Figure 245108DEST_PATH_IMAGE002
represents E1The left 3-vocabulary length vector representation,
Figure 855081DEST_PATH_IMAGE016
represents E1,E2The vector representation of the vocabulary in between,
Figure 884217DEST_PATH_IMAGE004
represents E2Vector representation of three word lengths on the right, similarity calculation between templates, template 1:
Figure 921443DEST_PATH_IMAGE017
and a template 2:
Figure 872081DEST_PATH_IMAGE006
if the condition is satisfied
Figure 285745DEST_PATH_IMAGE007
I.e. satisfy the template P1Event 1 type E of1And a template P2Event 1 type of
Figure 670852DEST_PATH_IMAGE018
Identical and template P1Event 2 type E of2And a template P2Event 2 type of
Figure 144559DEST_PATH_IMAGE019
Same, then template P1And a template P2Can be determined by
Figure 852269DEST_PATH_IMAGE010
Calculated as mu1μ2μ3Are weighted because
Figure 928678DEST_PATH_IMAGE020
The calculation result of the similarity between the templates is greatly influenced, and mu can be set213(ii) a If the condition is not satisfied
Figure 496188DEST_PATH_IMAGE007
Then template P1And a template P2The similarity of (a) can be recorded as 0;
the averaging method is to average the vectors of the templates in the same class to generate a new template, which can be referred to as "relation extraction snowball algorithm-programmer's book management" reported as "https:// www.pianshen.com/article/61161224295/".
Step C3., similarity calculation is carried out on the templates of the event tuples obtained in the step C1 and the templates in the rule base one by one, the templates with the similarity smaller than the threshold value of 0.7 are discarded, and the events in the templates with the similarity larger than the threshold value of 0.7 are added into the log key event relation table to replace the fault event relation table;
c4, repeating the steps C1-C3 until no template which can be discarded exists after the treatment of the step C3;
step R2 marks the fault log with each event relationship generated in step C4 as a log key event label.
As shown in fig. 1, the times of occurrence of each log key event label per minute is used as a monitoring index to establish each log KPI curve, and a gaussian kernel is used to smooth each log KPI curve;
step A9, marking according to the periodicity of log KPI curves;
carrying out periodic verification and inspection on the log KPI curves of each event relation, and marking the log KPI curves subjected to Gaussian kernel smoothing treatment according to periodic difference of the log KPI curves, wherein the labels are called log KPI curve period labels;
the step D1 periodic validation check includes the steps of:
extracting a frequency spectrum intensity graph of a log KPI curve by using Fourier transform;
z02, extracting the point with the highest vibration amplitude and calculating the corresponding period, namely the period to be checked;
and Z03, setting a hypothetical period, namely a waiting period, carrying out correlation strength detection on the waiting period if and only if the length of the waiting period is within the range of 95-105% of the expected period, and identifying the waiting period as a period meeting the requirement if the spectrum strength is sufficient.
Step A10 marking according to similarity classification of log KPI curves;
each log KPI curve mutually uses NCC algorithm to calculate pairwise similarity, and expands into a diagonal similarity matrix, and fills the similarity into the similarity matrix, wherein the serial numbers of rows and columns in the matrix are the serial numbers of the log KPI curves, the number of rows and columns in the similarity matrix is the number of the log KPI curves, and the numerical value in the similarity matrix is the similarity between the log KPI curves;
using a spectral clustering algorithm to mark different log KPI curve labels with clusters according to the similarity matrix to obtain a mapping relation (service implicit relation) of the log key event labels;
"https:// zhuanlan. zhihu. com/p/29849122" describes a classification method for spectral clustering.
Step A11 the KPI curve obtained in step A10 was pre-processed as in example 4.
Example 2
The method for marking waveband characteristics of the log KPI curve obtained based on the embodiment 1 comprises the following steps:
step A1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets with the time width of s minutesM i I is the segment number;
step A2, calculating Euclidean distances among all the sections of data sets according to the attributes of all the sections of log KPI curve data sets by using a dbscan algorithm, clustering the log KPI curve data sets of the sections i to obtain k clusters and abnormal items, wherein each cluster is a grouped data set, and each grouped data set has j sections of log KPI curve data setsF j
Step A3, calculating j sections of log KPI curve data sets in each grouped data setMean of operation, ΣF j /jAs the fundamental wave of the packet;
step A4, calculating each section of log KPI curve data set of each grouped data set by using NCC algorithmF j The waveform similarity with the fundamental wave is sorted from big to small, and the log KPI curve data sets with the waveform similarity sorted to the first 95 percent are recordedF j Taking the minimum value of the waveform similarity as the grouping boundary line of the groupB k
Step A5, calculating each log KPI curve data set by using NCC algorithmM i Waveform similarity with fundamental wave of each groupNCC M i-J k Judging whether each section of log KPI curve data set belongs to the group or not by taking the group boundary line of each group as a reference, and scoring one section of log KPI curve data set simultaneously belonging to a plurality of groups according to classificationQSorting is carried out, and log KPI curve data sets are obtainedM i Grouping to categorical scoreQIn the minimum grouping, the grouping information of each log KPI curve data set is obtained,
Q=((1-NCC M i-J k )/(1-B k ))2
NCC M i-J k the larger the size of the tube is,Qthe smaller theM i The more similar to cluster class k, the current log KPI curve datasetM i Similarity to different clustersNCC M i-J k When the phase of the mixture is the same as the phase of the mixture,B k smaller indicates the clusterM i Similarity to cluster kNCC M i-J k The more advanced in the waveform similarity ranking in the cluster class; by means of this formula the log KPI curve data set can be calculatedM i The likelihood among the candidate clusters, and thus which cluster is most likely to be.
A6, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;
step A7., performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;
event trigger interval, namely the time interval of two adjacent log KPI curve data sets in each grouped data set;
step A8., merging the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to the NCC; if the time interval KPI sets of different clusters are similar, the waveforms of the clusters are similar in the total time width;
step A9., expanding the similarity of the time interval KPI sets among the clusters obtained in step A8 into a similarity matrix; as shown in table 1, a to d are serial numbers of clusters, the number of rows and columns of the similarity matrix is the number of clusters, the numerical value in the similarity matrix is the similarity of the time interval KPI sets between clusters, and the similarity matrix is a diagonal matrix;
Figure 140796DEST_PATH_IMAGE021
step A10, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;
step A11, replacing the similarity value of which the value is greater than the inflection point in the similarity matrix with 1, and replacing the similarity value of which the value is less than the inflection point with 0, as shown in Table 2;
Figure 597185DEST_PATH_IMAGE022
step A12, marking the similarity of 1 in the similarity matrix obtained in the step A11 and adjacent clusters as the same similar group, and counting the cluster number of each similar group;
step A13, calculating the total time interval of a group with the most clusters in the similarity group as the width of a sliding window;
setting the total time interval as the width of a sliding window, and dividing the log KPI curve into a plurality of segments by using the window, wherein the time width of each segment covers the similarity group with the maximum time length obtained in the substep S12. The sliding window is used for scanning the log KPI curve, the continuously appeared clusters can be quickly divided into a window and then quickly clustered to the same waveform category, the calculated amount is reduced, the wave bands of the log KPI curve can be integrally classified, and the possibility of missing knowledge is reduced.
The above NCC (normalized cross correlation) algorithm is defined as:
Figure 352652DEST_PATH_IMAGE023
in the formula, xtAs a background waveform, yt+hThe value of NCC is between-1 and 1, wherein, -1 represents that the waveforms before and after transformation are opposite, 0 represents that the two waveforms are orthogonal, and 1 represents the same. The NCC only describes the macroscopic similarity degree of the two waveforms, and is not related to the amplitude of the waveforms and the energy attenuation.
Step A14, firstly, according to the sliding window obtained in the step A13, each log KPI curve obtained in the step F10 is divided into a plurality of log KPI curve window sections with the time sequence width as the total time interval, and according to the dividing method in the step A1, the log KPI curve window sections are divided into i-section log KPI curve data sets with the time sequence width of 1 minute
Figure 210886DEST_PATH_IMAGE012
Each segment is a band;
and B, carrying out similarity calculation on the fundamental waves obtained in the step A2 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain
Figure 26395DEST_PATH_IMAGE024
And sorting from large to small, in the wave band whose waveform similarity is sorted to top 95%, taking the minimum value of waveform similarity as group boundary line B of said group kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference
Figure 173343DEST_PATH_IMAGE012
Whether belonging to the group or not, for a segment of log KPI curve data set simultaneously belonging to a plurality of groups
Figure 466921DEST_PATH_IMAGE012
Score according to classification
Figure 678197DEST_PATH_IMAGE025
Sequencing is carried out, and a log KPI curve data set M is obtainediGrouping to categorical score
Figure 664608DEST_PATH_IMAGE025
In the smallest grouping, a label chain formed by fundamental labels is formed as shown in FIG. 2, mode waveforms of different KPIs are obtained, and the mode waveforms are called KPI curve code pattern rearrangement tables,
Figure 95589DEST_PATH_IMAGE026
the tag information obtained after the processing in step a14 contains all information of all bands, including two parts of band and waveform representation, the band tag has a fundamental wave type, and the waveform tag has two types, namely a service tag and a period tag.
In this way, each time a window is slid on a log KPI curve, one band chain is obtained, all band chains are equal in length, and only the sorting labels of the bands are different, in this embodiment, the curve characteristics of the log KPI curves of different monitoring indexes having a relationship are converted into the label chain sorting characteristics, and due to the relationship, although the amplitudes of the log KPI curves are different, the periods are similar to each other, the rhythm is similar, that is, the labels are arranged, so that a large number of KPI curves having a relationship can be unified into a standard and consistent label chain.
And A15, placing different KPI curve code pattern rearrangement tables in one dimension in a unified time dimension to obtain a KPI curve code pattern rearrangement association table.
Different log KPI curves may have causal relationships if the same log KPI curve service label is used, wherein a log KPI belonging to an aperiodic log KPI has a higher probability than a periodic log KPI curve.
Different log KPI curves may have causal relationships if the same log KPI curve pattern fundamental signature is present in adjacent time segments, with a higher probability for more repetitions.
After all tag chains are arranged according to the time dimension, the sequence mining algorithm SPADE or GSP can be used for discovering the causal relationship between different tag chains occurring at different times, if two events always occur in pairs, the two events are considered to be related, and if one event always occurs before the other event, the causal relationship and the pre-causal effect between the two events are considered. The method is beneficial to supplementing a knowledge system for fault determination in the system by experts and discovering the incidence relation of monitoring indexes which are not discovered before, so that a new early warning control relation and a regulation and control threshold value can be established based on the incidence relation between the newly discovered monitoring indexes in operation, and the system stability of each monitored object in the same system is improved.

Claims (10)

1. A method for generating KPI curves based on log event relations comprises the following steps:
step F1, setting a training sentence subset consisting of training sentences, obtaining a fault log by the industrial control equipment in the same industrial control system based on monitoring indexes, forming a sentence pair to be processed by the corpora in the fault log and each training sentence respectively, calculating the similarity, and deleting the corpora with the similarity lower than a threshold value one;
step F2., performing word segmentation on the residual corpus in the step F1, generating a word segmentation queue consisting of a plurality of characteristic words, and labeling part of speech of the plurality of characteristic words to obtain a part of speech queue of the corpus;
step F3., if the part-of-speech queue contains a plurality of special feature words corresponding to special parts-of-speech, obtaining the boundary and category of the named entity from the plurality of special feature words by using the named entity recognition model, updating the parts-of-speech of the special feature words in the part-of-speech queue to the boundary and category of the named entity, and obtaining an updated part-of-speech queue, wherein the special parts-of-speech includes: number word, time word;
step F4., classifying the remaining corpora according to the label of the remaining corpora of F3, counting the occurrence frequency of part-of-speech queues of each category, sorting in descending order, selecting part-of-speech queues with the sorting greater than a threshold two, and counting various parts-of-speech queues of each category: the occurrence frequency of verbs and nouns is subjected to descending order, two parts of speech queue sets with the top rank are sequentially screened out from the two orders according to an ordering threshold, the corpus corresponding to the intersection of the two parts of speech queue sets is extracted, and a true training set is constructed;
step F5., screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, wherein n represents the part-of-speech of a noun, v represents the part-of-speech of a verb, and extracting a first participle and a second participle with parts-of-speech of a noun or a proper noun from the participle queue as an event I and an event II respectively to form an event tuple;
step F6, based on the existing fault event relation table, using Snowball algorithm to find the event association rule of the event tuple, and finding the association event group in the event tuple according to the event association rule, namely generating a log key event relation table;
step F7, repeatedly using the step F6 based on the log key event relation table until convergence;
step F8., using each event relation generated in step F7 as a log key event label to mark a fault log, using the frequency of each log key event label appearing per minute as a monitoring index, establishing each log KPI curve, and using a gaussian kernel to smooth each log KPI curve.
2. The method according to claim 1, wherein the calculating of the similarity in step F1 includes the steps of: respectively segmenting the sentences in the sentence pairs based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;
and converting each characteristic word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold value one.
3. The method of claim 2, wherein step F8 is further followed by:
extracting a frequency spectrum intensity graph of a log KPI curve by using Fourier transform;
z02, extracting the point with the highest vibration amplitude and calculating the corresponding period, namely the period to be checked;
and Z03, setting an assumed period, namely a waiting period, detecting the correlation strength of the period to be detected if and only if the length of the period to be detected is within the range of 95-105% of the expected period, determining the period to be detected as a period meeting the requirement if the spectrum strength is sufficient, and marking a filtered log KPI curve according to the periodic difference of the log KPI curves, namely a log KPI curve period label.
4. The method of claim 3, wherein step Z03 is further followed by:
the similarity matrix is filled with the similarities, the serial numbers of rows and columns in the matrix are the numbers of the log KPI curves, and the number of rows and columns of the similarity matrix is the number of the log KPI curves;
and Z05, outputting different clusters according to the similarity matrix by using a spectral clustering algorithm, and marking different log KPI curve labels, called KPI curve service labels, for the different clusters.
5. The method according to claim 1, wherein step F6 includes:
c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is a length which can be set arbitrarily, < left > is a vector representation of len words on the left side of the event 1, < middle > is a vector representation of words between the event 1 and the event 2, and < right > is a vector representation of len words on the right side of the event;
step C2. clustering the generated templates, and comparing the similarity with a third thresholdThe boards are gathered into one type, a new template is generated by using an averaging method, and the template is added into a rule base for storing the template; the template format can be written as known from step C2
Figure 486712DEST_PATH_IMAGE001
,E1、E2Respectively indicating an event 1 type and an event 2 type of the template P,
Figure 53960DEST_PATH_IMAGE002
represents E1The left 3-vocabulary length vector representation,
Figure 211271DEST_PATH_IMAGE003
represents E1、E2The vector representation of the vocabulary in between,
Figure 129549DEST_PATH_IMAGE004
represents E2Vector representation of three word lengths on the right, similarity calculation between templates, template 1:
Figure 266394DEST_PATH_IMAGE005
and a template 2:
Figure 953728DEST_PATH_IMAGE006
if the condition is satisfied
Figure 547520DEST_PATH_IMAGE007
I.e. satisfy the template P1Event 1 type E of1And a template P2Event 1 type of
Figure 421935DEST_PATH_IMAGE008
Identical and template P1Event 2 type E of2And a template P2Event 2 type of
Figure 392165DEST_PATH_IMAGE009
Same, then template P1And a template P2Can be determined by
Figure 934005DEST_PATH_IMAGE010
Calculated as mu1μ2μ3Are weighted because
Figure 433119DEST_PATH_IMAGE011
The calculation result of the similarity between the templates is greatly influenced, and mu can be set213(ii) a If the condition is not satisfied
Figure 263672DEST_PATH_IMAGE007
Then template P1And a template P2The similarity of (a) can be recorded as 0;
step C3., similarity calculation is carried out on the event tuple templates obtained in the step C1 and the templates in the rule base one by one, the similarity is abandoned if the similarity is smaller than the threshold value three, and the events in the templates with the similarity larger than the threshold value three are added into the log key event relation table to replace the fault event relation table.
6. A method for characterising a KPI curve obtained by the method of claim 4, including the steps of:
step A1, merging data points of all minutes in all log KPI curves, dividing the data points into a plurality of band segments with time width of s minutes, clustering the band segments into a plurality of clusters according to non-time dimensions of the band segments, extracting fundamental waves of all the clusters, comparing similarity between band data of all the clusters and the fundamental waves, finding out grouping boundary lines of all the clusters, and grouping the band data of all the clusters;
a2, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;
step A3, performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;
step A4, combining the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to NCC;
step A5, expanding the similarity of the time interval KPI sets among the clusters obtained in the step A4 into a similarity matrix;
a6, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;
step A7., marking adjacent clusters with numerical values larger than the inflection point in the similarity matrix as the same similar group, and counting the cluster number of each similar group;
step A8., calculating the total time interval of the group with the most clusters in the similarity group as the width of the sliding window;
step A9. is to divide each log KPI curve into several log KPI curve window segments with time sequence width as total time interval according to the sliding window obtained in step A8, and to divide the log KPI curve window segments into i-segment log KPI curve data sets with time sequence width of 1 minute according to the dividing method in step A1
Figure 240855DEST_PATH_IMAGE012
Each segment is a band;
comparing the similarity of each fundamental wave obtained in the step A1 with each wave band in each window of each log KPI curve one by one, sequencing the similarity from large to small, finding out grouping boundary lines according to the sequencing, grouping the wave bands to form a label chain formed by fundamental wave labels, and acquiring mode waveforms of different KPIs, wherein the mode waveforms are called KPI curve code pattern rearrangement tables;
and A10, placing different KPI curve pattern rearrangement tables in one dimension in a time dimension to obtain a KPI curve pattern rearrangement association table.
7. The method of claim 6, wherein step A1 comprises the steps of:
step J1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of sectionsSeveral segments of log KPI curve data set M with time width of s minutesiI is the segment number;
j2. calculating Euclidean distance between each segment of data sets according to the attribute of each segment of log KPI curve data set by using dbscan algorithm, clustering the log KPI curve data sets of i segments to obtain k clusters and abnormal items, wherein each cluster is a packet data set, and each packet data set has j segments of log KPI curve data sets Fj
Step J3. calculates an arithmetic mean value Σ F for j segments of log KPI curve data sets in each packet data setj(j) as the fundamental wave of the group;
step J4. uses the NCC algorithm to compute the log KPI curve data sets F for each packet data setjThe waveform similarity with the fundamental wave is sorted from big to small, and a log KPI curve data set F with the waveform similarity sorted to the first 95 percent is obtainedjTaking the minimum value of the waveform similarity as the grouping boundary line B of the groupk
Step J5. uses the NCC algorithm to calculate each segment of the log KPI curve data set MiWaveform similarity with fundamental wave of each group
Figure 401316DEST_PATH_IMAGE013
Judging whether each section of log KPI curve data set belongs to the group or not by taking the group boundary line of each group as a reference, sequencing one section of log KPI curve data set simultaneously belonging to a plurality of groups according to the classification score Q, and sequencing a log KPI curve data set MiGrouping the data into groups with the minimum classification score Q to obtain grouping information of each log KPI curve data set,
Figure 336911DEST_PATH_IMAGE014
8. the method of claim 6, wherein step a7 is replaced with: replacing the similarity value with the value larger than the inflection point in the similarity matrix with 1, and replacing the similarity value with the value lower than the inflection point with 0;
and marking the similarity in the obtained similarity matrix as 1 and adjacent clusters as the same similar group, and counting the cluster number of each similar group.
9. The method of claim 6, wherein the step of dividing the KPI curve window segment into bands in step a9 is: and B, carrying out similarity calculation on the fundamental waves obtained in the step A1 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain
Figure 451498DEST_PATH_IMAGE015
And sorting from large to small, in the wave band whose waveform similarity is sorted to top 95%, taking the minimum value of waveform similarity as group boundary line B of said group kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference
Figure 435634DEST_PATH_IMAGE012
Whether belonging to the group or not, for a segment of log KPI curve data set simultaneously belonging to a plurality of groups
Figure 952066DEST_PATH_IMAGE012
Score according to classification
Figure 792983DEST_PATH_IMAGE016
Sequencing is carried out, and a log KPI curve data set M is obtainediGrouping to categorical score
Figure 660445DEST_PATH_IMAGE016
In the minimum grouping, a label chain formed by fundamental wave labels is formed, mode waveforms of different KPIs are obtained, the mode waveforms are called KPI curve code pattern rearrangement tables,
Figure 245010DEST_PATH_IMAGE017
10. the method of claim 6, wherein after all tag chains are arranged according to the time dimension, causal relationships between different tag chains occurring at different times are discovered based on a sequence mining algorithm SPADE or GSP.
CN202210292597.6A 2022-03-18 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation Active CN114398898B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210292597.6A CN114398898B (en) 2022-03-24 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation
PCT/CN2023/082359 WO2023174431A1 (en) 2022-03-18 2023-03-17 Kpi curve data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210292597.6A CN114398898B (en) 2022-03-24 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation

Publications (2)

Publication Number Publication Date
CN114398898A true CN114398898A (en) 2022-04-26
CN114398898B CN114398898B (en) 2022-06-24

Family

ID=81234703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210292597.6A Active CN114398898B (en) 2022-03-18 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation

Country Status (1)

Country Link
CN (1) CN114398898B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116405551A (en) * 2023-04-14 2023-07-07 廊坊风川科技有限公司 Social platform-based data pushing method and system and cloud platform
WO2023174431A1 (en) * 2022-03-18 2023-09-21 三峡智控科技有限公司 Kpi curve data processing method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130081056A1 (en) * 2011-09-23 2013-03-28 Avaya Inc. System and method for aligning messages to an event based on semantic similarity
CN105573977A (en) * 2015-10-23 2016-05-11 苏州大学 Method and system for identifying Chinese event sequential relationship
CN110210019A (en) * 2019-05-21 2019-09-06 四川大学 A kind of event argument abstracting method based on recurrent neural network
WO2019172848A1 (en) * 2018-03-06 2019-09-12 Agency For Science, Technology And Research Method and apparatus for predicting occurrence of an event to facilitate asset maintenance
CN111177505A (en) * 2019-12-31 2020-05-19 中国移动通信集团江苏有限公司 Training method, recommendation method and device of index anomaly detection model
CN111738308A (en) * 2020-06-03 2020-10-02 浙江中烟工业有限责任公司 Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning
CN112966079A (en) * 2021-03-02 2021-06-15 中国电子科技集团公司第二十八研究所 Event portrait oriented text analysis method for dialog system
CN113312447A (en) * 2021-03-10 2021-08-27 天津大学 Semi-supervised log anomaly detection method based on probability label estimation
CN113326244A (en) * 2021-05-28 2021-08-31 中国科学技术大学 Abnormity detection method based on log event graph and incidence relation mining
CN113723452A (en) * 2021-07-19 2021-11-30 山西三友和智慧信息技术股份有限公司 Large-scale anomaly detection system based on KPI clustering
CN114202009A (en) * 2021-09-27 2022-03-18 南开大学 Medical equipment performance index abnormity detection method and device based on PU learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130081056A1 (en) * 2011-09-23 2013-03-28 Avaya Inc. System and method for aligning messages to an event based on semantic similarity
CN105573977A (en) * 2015-10-23 2016-05-11 苏州大学 Method and system for identifying Chinese event sequential relationship
WO2019172848A1 (en) * 2018-03-06 2019-09-12 Agency For Science, Technology And Research Method and apparatus for predicting occurrence of an event to facilitate asset maintenance
CN110210019A (en) * 2019-05-21 2019-09-06 四川大学 A kind of event argument abstracting method based on recurrent neural network
CN111177505A (en) * 2019-12-31 2020-05-19 中国移动通信集团江苏有限公司 Training method, recommendation method and device of index anomaly detection model
CN111738308A (en) * 2020-06-03 2020-10-02 浙江中烟工业有限责任公司 Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning
CN112966079A (en) * 2021-03-02 2021-06-15 中国电子科技集团公司第二十八研究所 Event portrait oriented text analysis method for dialog system
CN113312447A (en) * 2021-03-10 2021-08-27 天津大学 Semi-supervised log anomaly detection method based on probability label estimation
CN113326244A (en) * 2021-05-28 2021-08-31 中国科学技术大学 Abnormity detection method based on log event graph and incidence relation mining
CN113723452A (en) * 2021-07-19 2021-11-30 山西三友和智慧信息技术股份有限公司 Large-scale anomaly detection system based on KPI clustering
CN114202009A (en) * 2021-09-27 2022-03-18 南开大学 Medical equipment performance index abnormity detection method and device based on PU learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023174431A1 (en) * 2022-03-18 2023-09-21 三峡智控科技有限公司 Kpi curve data processing method
CN116405551A (en) * 2023-04-14 2023-07-07 廊坊风川科技有限公司 Social platform-based data pushing method and system and cloud platform
CN116405551B (en) * 2023-04-14 2024-03-29 深圳市优友网络科技有限公司 Social platform-based data pushing method and system and cloud platform

Also Published As

Publication number Publication date
CN114398898B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
US11449673B2 (en) ESG-based company evaluation device and an operation method thereof
CN114398898B (en) Method for generating KPI curve and marking wave band characteristics based on log event relation
CN110196906B (en) Deep learning text similarity detection method oriented to financial industry
CN114398891B (en) Method for generating KPI curve and marking wave band characteristics based on log keywords
CN105117426B (en) A kind of intellectual coded searching method of customs
CN102411563A (en) Method, device and system for identifying target words
CN114386538B (en) Method for marking wave band characteristics of KPI (Key performance indicator) curve of monitoring index
CN111506637B (en) Multi-dimensional anomaly detection method and device based on KPI (Key Performance indicator) and storage medium
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN106528527A (en) Identification method and identification system for out of vocabularies
US11093864B1 (en) Distributable feature analysis and tree model training system
CN116737967B (en) Knowledge graph construction and perfecting system and method based on natural language
CN110866169B (en) Learning-based Internet of things entity message analysis method
CN112800232A (en) Big data based case automatic classification and optimization method and training set correction method
Hussain et al. Design and analysis of news category predictor
CN112905793A (en) Case recommendation method and system based on Bilstm + Attention text classification
CN114880584B (en) Generator set fault analysis method based on community discovery
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN111159328A (en) Information knowledge fusion system and method
CN113610112B (en) Auxiliary decision-making method for aircraft assembly quality defects
CN115994531A (en) Multi-dimensional text comprehensive identification method
WO2023174431A1 (en) Kpi curve data processing method
CN113901223B (en) Method, device, computer equipment and storage medium for generating enterprise classification model
CN116579344B (en) Case main body extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant