CN114398898B - Method for generating KPI curve and marking wave band characteristics based on log event relation - Google Patents

Method for generating KPI curve and marking wave band characteristics based on log event relation Download PDF

Info

Publication number
CN114398898B
CN114398898B CN202210292597.6A CN202210292597A CN114398898B CN 114398898 B CN114398898 B CN 114398898B CN 202210292597 A CN202210292597 A CN 202210292597A CN 114398898 B CN114398898 B CN 114398898B
Authority
CN
China
Prior art keywords
log
kpi
similarity
event
curve
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210292597.6A
Other languages
Chinese (zh)
Other versions
CN114398898A (en
Inventor
戴曦
尹立超
徐旭朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Three Gorges Zhikong Technology Co ltd
Original Assignee
Three Gorges Zhikong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Three Gorges Zhikong Technology Co ltd filed Critical Three Gorges Zhikong Technology Co ltd
Priority to CN202210292597.6A priority Critical patent/CN114398898B/en
Publication of CN114398898A publication Critical patent/CN114398898A/en
Application granted granted Critical
Publication of CN114398898B publication Critical patent/CN114398898B/en
Priority to PCT/CN2023/082359 priority patent/WO2023174431A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for generating a KPI curve based on a log event relation and marking waveband features, which comprises the steps of firstly generating a log KPI curve according to the relation of events in a log, then dividing the KPI curve into a plurality of wavebands with equal lengths, clustering the wavebands into a plurality of clusters according to the non-time dimension of the wavebands, extracting the fundamental wave of each cluster, comparing the similarity of each waveband data of each cluster and the fundamental wave, finding out the grouping boundary line of each cluster, grouping each waveband data of each cluster, extracting the total time length of continuous similar wavebands in each cluster, and taking the maximum value of the total time length as the width of a sliding window. The window is used for segmenting the KPI curve, so that the wave bands in each segmented window are easy to cluster and classify, the whole KPI curve is favorably and rapidly divided into wave band chains consisting of different types of wave bands, then the KPI curve of an individual monitoring index is subjected to periodic detection and type detection marking, the individual KPI curve is segmented by using the window, and the wave bands in the fundamental wave KPI curve are used for grouping and labeling.

Description

Method for generating KPI curve and marking wave band characteristics based on log event relation
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method for generating a KPI curve and marking wave band characteristics based on log event relation.
Background
Outlier detection (also known as outlier detection) is a detection process that finds objects whose behavior is different from that of the expected objects, which are called outliers or outliers. The anomaly detection means generally includes a statistical-based model, a distance-based model, a linear-transformation model, a nonlinear-transformation model, a machine-learning model, and the like.
Kpis (key performance indicators) refer to monitoring metrics (e.g., delay, throughput, etc. in a network) for objects such as services, systems, etc. The storage form is a sequence formed by arranging the occurrence time sequence, namely a time sequence which is generally called. The abnormal detection of the time series is to check whether the current data is obviously deviated from the normal condition through historical data analysis. KPI data anomaly detection has very important meaning: through real-time monitoring of KPI data, the abnormality of KPI data is discovered, and corresponding processing is carried out in time, so that the normal operation of the application is ensured.
Methods for performing real-time anomaly detection by setting a threshold value for KPI data are quite common, but methods for performing real-time anomaly detection for system logs have not been reported publicly.
In order to pursue effectiveness, a supervision learning mode is mostly adopted in traditional machine learning, abnormal labels are difficult to obtain in batches in practice, accuracy of model output is improved through massive labeled data samples, so that a large number of business experts are needed to label KPI curves manually, repeated adjustment and correction are often needed, time and labor are consumed, and millions and tens of millions of KPIs may need to be monitored at the same time in practice, so that an algorithm cannot be found in actual abnormal detection practice to meet the requirements at the same time, and the above challenges cannot be solved at the same time; the unsupervised learning common clustering technology and the like are mainly used for scenes such as feature discovery, data exploration and the like, and because of lack of labels, the result can be abstractly mapped to a business mode only by being interpreted by a data scientist, and the result cannot be directly acted; in the specific implementation of weak supervision, due to the introduction of an unsupervised/supervised method in stages, the accuracy of circular recursion is improved, the method is too academic and difficult to fall on the ground, and on the other hand, in order to fuse specific methods, vector expression is required to be adopted to unify the representation among different methods, so that the result is difficult to understand by application personnel.
The more the data volume is, the more complex the service scene is, the more complex the introduction manner is, and the more diversified the investment cost/manpower is required. The circulation directly limits the popularization of machine learning in the whole industry, and focuses on the industry with higher income, so that the conventional industry only adopts abandon resistance and passive defense, and flows backwards depending on the average level of the whole industry, and the migration of a service scene is realized, and the method specifically comprises the following steps: if a method is particularly effective in other industries, the person is left with the surplus to borrow the observation effect, if feasible, to use. One such industry of passive defense is an industrial application scenario.
Disclosure of Invention
The first purpose of the invention is to provide a method for generating KPI curves and marking wave band characteristics based on log event relations, which processes text logs generated by monitoring indexes in an industrial control system, combines highly correlated events into a same group, and generates log KPI curves periodically correlated with the KPI curves of the monitored indexes.
The technical scheme of the invention is as follows: a method for generating KPI curves based on log event relations comprises the following steps:
step F1, setting a training sentence subset consisting of training sentences, obtaining a fault log by the industrial control equipment in the same industrial control system based on monitoring indexes, forming a sentence pair to be processed by the corpora in the fault log and each training sentence respectively, calculating the similarity, and deleting the corpora with the similarity lower than a threshold value one;
step F3., performing word segmentation on the residual corpus in the step F2, generating a word segmentation queue consisting of a plurality of characteristic words, and labeling part of speech of the plurality of characteristic words to obtain a part of speech queue of the corpus;
step F3., if the part-of-speech queue contains a plurality of special feature words corresponding to special parts-of-speech, obtaining the boundary and category of the named entity from the plurality of special feature words by using the named entity recognition model, updating the parts-of-speech of the special feature words in the part-of-speech queue to the boundary and category of the named entity, and obtaining an updated part-of-speech queue, wherein the special parts-of-speech includes: number word, time word;
step F4., classifying the residual corpus according to the label of F3 to the residual corpus, counting the occurrence frequency of each category part-of-speech queue, sorting in descending order, selecting the part-of-speech queue with the sequence greater than the second threshold value, counting the occurrence frequency of each verb and noun in each category part-of-speech queue, sorting in descending order, sequentially screening two part-of-speech queue sets with the top rank from the sequence according to the occurrence frequency of verbs and nouns according to the sorting threshold value, extracting the corpus corresponding to the intersection of the two part-of-speech queue sets, and constructing a true training set;
step F5., screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, wherein n represents the part-of-speech of a noun, v represents the part-of-speech of a verb, and extracting first and second participles with parts-of-speech of a noun and a proper noun as an event first and an event second respectively to form an event tuple;
step F6., based on the existing fault event relationship table, using Snowball algorithm to find the event association rule of the event tuple, and finding the association event group in the event tuple according to the event association rule, i.e. generating a log key event relationship table;
step F7. repeats using step F6 based on the log key event relationship table until convergence;
step F8., using each event relation generated in step F7 as a log key event label to mark a fault log, using the frequency of each log key event label appearing per minute as a monitoring index, establishing each log KPI curve, and using a gaussian kernel to smooth each log KPI curve.
Advantageously, the same industrial control system is composed of industrial control devices which have a direct or indirect material supply relationship, an electric energy transfer relationship, a thermal energy transfer relationship, a mechanical energy transfer relationship, a magnetic field transfer relationship, an energy conversion relationship or a signal control relationship, the industrial control devices in the same industrial control system obtain fault logs based on monitoring indexes, the fault logs also have relevance due to the fact that the monitoring indexes have relevance, each record of the monitoring indexes in the logs has partial text difference, direct clustering needs a large amount of manual indexing and screening work, log texts describing behaviors or states of the devices or the devices have similar sentence text structures and similar part-of-speech queue characteristics, texts of the similar part-of-speech queues are screened out in steps F1-F4, and log texts which are not used for recording the behaviors or states of the devices or the devices are eliminated; nouns and nouns in the text often have a specific associated logical relationship, and highly related event relationships can be combined into the same group according to the relationship to generate a log KPI curve periodically related to the KPI curve of the monitored indicator.
Further, the calculating of the similarity in step F1 includes the steps of: respectively segmenting the sentences in the sentence pairs based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;
and converting each characteristic word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold value one.
Further, step F8 is followed by:
extracting a frequency spectrum intensity graph of a log KPI curve by using Fourier transform;
z02, extracting the point with the highest vibration amplitude and calculating the corresponding period, namely the period to be checked;
and Z03, setting an assumed period, namely a waiting period, detecting the correlation strength of the period to be detected if and only if the length of the period to be detected is within the range of 95-105% of the expected period, determining the period to be detected as a period meeting the requirement if the spectrum strength is sufficient, and marking a filtered log KPI curve according to the periodic difference of the log KPI curves, namely a log KPI curve period label.
The period inspection is to mark the waveform with periodic and non-periodic marks, the periodic marks represent that regular and repeated events exist, and the information usually means service information such as state detection on service knowledge and rotating parts; relatively non-periodic in contrast means event traffic. They are all service tags used in other steps and are not related to other operations; the similarity of the periodic KPIs is probably because of similarity relations formed for various reasons, no business relation exists, and the non-periodic KPIs are more probably that direct and indirect relations exist.
Further, step Z03 is followed by:
z04, calculating pairwise similarity of each log KPI curve by using an NCC algorithm, expanding the similarity into a diagonal similarity matrix, and filling the similarity into the similarity matrix, wherein the serial numbers of rows and columns in the matrix are the numbers of the log KPI curves, and the number of rows and columns of the similarity matrix is the number of the log KPI curves;
and Z05, outputting different clusters according to the similarity matrix by using a spectral clustering algorithm, and marking different log KPI curve labels, called KPI curve service labels, for the different clusters.
Advantageously, the KPI curves are clustered and classified according to overall similarity of the KPI curves to form clusters with similar waveforms.
Further, step F6 includes:
c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is a length which can be set arbitrarily, < left > is a vector representation of len words on the left side of the event 1, < middle > is a vector representation of words between the event 1 and the event 2, and < right > is a vector representation of len words on the right side of the event;
c2. clustering the generated templates, clustering the templates with similarity greater than the threshold value three into a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; the template format can be written as known from step C2
Figure 193871DEST_PATH_IMAGE001
,E1、E2Respectively indicating an event 1 type and an event 2 type of the template P,
Figure 378864DEST_PATH_IMAGE002
represents E1The left 3-vocabulary length vector representation,
Figure 430741DEST_PATH_IMAGE003
represents E1,E2The vector representation of the vocabulary in between,
Figure 347881DEST_PATH_IMAGE004
represents E2Vector representation of three word lengths on the right, similarity calculation between templates, template 1:
Figure 189935DEST_PATH_IMAGE005
and a template 2:
Figure 116303DEST_PATH_IMAGE007
if the condition is satisfied
Figure 196254DEST_PATH_IMAGE008
I.e. satisfy the template P1Event 1 type E of1And a template P2Event 1 type of
Figure 18717DEST_PATH_IMAGE009
Identical and template P1Event 2 type E of2And a template P2Event 2 type of
Figure 816908DEST_PATH_IMAGE010
Same, then template P1And a template P2Can be determined by
Figure 812546DEST_PATH_IMAGE011
Calculated as mu1μ2μ3Are weighted because
Figure 248469DEST_PATH_IMAGE012
The calculation result of the similarity between the templates is greatly influenced, and mu can be set213(ii) a If the condition is not satisfied
Figure 241833DEST_PATH_IMAGE008
If the similarity between the template P1 and the template P2 is 0;
step C3., similarity calculation is carried out on the event tuple templates obtained in the step C1 and the templates in the rule base one by one, the similarity is abandoned if the similarity is smaller than the threshold value three, and the events in the templates with the similarity larger than the threshold value three are added into the log key event relation table to replace the fault event relation table.
The invention also aims to provide a method for marking waveband characteristics by the KPI curve, which comprises the steps of dividing the KPI curve into a plurality of wavebands with equal length, clustering into a plurality of clusters according to the non-time dimension of the wavebands, extracting the fundamental wave of each cluster, comparing the similarity between each waveband data of each cluster and the fundamental wave, finding out the grouping boundary line of each cluster, grouping each waveband data of each cluster, extracting the total time length of continuous similar wavebands in each cluster, and taking the maximum value of the total time length as the width of a sliding window. The window is used for partitioning the log KPI curve, so that the wave bands in each partitioned window are easy to cluster and classify, the whole log KPI curve is favorably and rapidly divided into wave band chains consisting of different types of wave bands, then the log KPI curve of an individual monitoring index is subjected to periodic detection and type detection marking, the individual log KPI curve is partitioned by using the window, and the wave bands in the log KPI curve are subjected to grouping and labeling by using fundamental waves.
The method for marking the wave band characteristics of the KPI curve obtained by the method comprises the following steps:
step A1, merging data points of all minutes in all log KPI curves, dividing the data points into a plurality of band segments with time width of s minutes, clustering the band segments into a plurality of clusters according to non-time dimensions of the band segments, extracting fundamental waves of all the clusters, comparing similarity between band data of all the clusters and the fundamental waves, finding out grouping boundary lines of all the clusters, and grouping the band data of all the clusters;
a2, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;
step A3, performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;
step A4, combining the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to NCC;
step A5, expanding the similarity of the time interval KPI sets among the clusters obtained in the step A4 into a similarity matrix;
a6, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;
a7., marking adjacent clusters with values larger than the inflection point in the similarity matrix as the same similar group, and counting the cluster number of each similar group;
step A8., calculating the total time interval of the group with the most clusters in the similarity group as the width of the sliding window;
step A9. is to divide each log KPI curve into several log KPI curve window segments with time sequence width as total time interval according to the sliding window obtained in step A8, and to divide the log KPI curve window segments into i-segment log KPI curve data sets with time sequence width of 1 minute according to the dividing method in step A1
Figure 527321DEST_PATH_IMAGE013
Each segment is a band;
comparing the similarity of each fundamental wave obtained in the step A1 with each wave band in each window of each log KPI curve one by one, sequencing the similarity from large to small, finding out grouping boundary lines according to the sequencing, grouping the wave bands to form a label chain formed by fundamental wave labels, and acquiring mode waveforms of different KPIs, wherein the mode waveforms are called KPI curve code pattern rearrangement tables;
and A10, placing different KPI curve pattern rearrangement tables in one dimension in a time dimension to obtain a KPI curve pattern rearrangement association table.
Advantageously, the label information obtained after the log KPI curve is processed contains all information of all bands, including two parts of band and waveform representation, the band labels are the fundamental wave type and the time arrangement information of the fundamental wave label, and the waveform label includes two kinds of service labels and period labels.
Different KPI curves may have causal relationships if the same KPI curve traffic label is used, where a KPI belonging to an aperiodic KPI has a higher probability than a periodic KPI curve.
Different KPI curves may have causal relationships if the same KPI curve pattern fundamental signature is present in adjacent time segments, with a higher probability for more repetitions.
Further, step a1 includes the following steps: step J1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets M with the time width of s minutesiI is the segment number;
step J2. is based on each segment of log KP using the dbscan algorithmCalculating Euclidean distance between each section of data sets by the attribute of the I curve data set, clustering the log KPI curve data sets of the I sections to obtain k clusters and abnormal items, wherein each cluster is a grouped data set, and each grouped data set comprises j sections of log KPI curve data sets Fj
Step J3. calculating an arithmetic mean value Σ Fj/j of j log KPI curve data sets in each group data set as a fundamental wave of the group;
step J4. uses NCC algorithm to calculate the waveform similarity between each section of log KPI curve data set Fj of each packet data set and the fundamental wave, and arranges from big to small, and the log KPI curve data set F with the waveform similarity ordering of the first 95 percentjTaking the minimum value of the waveform similarity as the grouping boundary line B of the groupk
Step J5. uses the NCC algorithm to calculate each segment of the log KPI curve data set MiWaveform similarity NCC with fundamental wave of each groupMi-JkJudging whether each section of log KPI curve data set belongs to the group by taking the group boundary line of each group as a reference, sequencing one section of log KPI curve data set simultaneously belonging to a plurality of groups according to the classification score Q, and sequencing a log KPI curve data set MiGrouping the data into groups with the minimum classification score Q to obtain grouping information of each log KPI curve data set,
Q=((1-NCCM i-Jk)/(1-Bk))2
further, step a7 is replaced with: replacing the similarity value with the value larger than the inflection point in the similarity matrix with 1, and replacing the similarity value with the value lower than the inflection point with 0;
and marking the similarity in the obtained similarity matrix as 1 and adjacent clusters as the same similar group, and counting the cluster number of each similar group.
Further, the step of dividing the KPI curve window segment into bands in step a9 is: and B, carrying out similarity calculation on the fundamental waves obtained in the step A2 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain
Figure 61070DEST_PATH_IMAGE014
And sorting from large to small, and taking the minimum value of the waveform similarity as a grouping boundary line B 'of the grouping in the wave band with the waveform similarity sorting of the first 95 percent'kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference
Figure 115614DEST_PATH_IMAGE013
Whether belonging to the group or not, for a section of log KPI curve data set simultaneously belonging to a plurality of groups
Figure 279879DEST_PATH_IMAGE013
Score according to classification
Figure 787084DEST_PATH_IMAGE015
Sequencing is carried out, and a log KPI curve data set M is obtainediGrouping to categorical score
Figure 124524DEST_PATH_IMAGE015
In the minimum grouping, a label chain formed by fundamental wave labels is formed, mode waveforms of different KPIs are obtained, the mode waveforms are called KPI curve code pattern rearrangement tables,
Figure 299154DEST_PATH_IMAGE016
further, after all tag chains are arranged according to the time dimension, causal relationships among different tag chains occurring at different times are discovered based on a sequence mining algorithm SPADE or GSP.
Specific nouns in the text of the fault log generated by the industrial control equipment of the same industrial control system have mutual causal influence, and are shown in the way that paired nouns synchronously appear due to the same inducement, similar noun queues can be classified into one class, namely, the event relationship obtained in step F8, and the frequency obtained by counting the event relationship can obtain a log KPI curve, and the log KPI curve appears together with an index KPI curve obtained by monitoring the physical parameter analog quantity by the industrial control equipment, so that the index KPI curve can be classified and clustered into a band chain with tag sorting characteristics, and therefore, the log KPI curve also has the same band chain characteristics, and the band chain characteristics of the index KPI curves generated by the same inducement for different physical parameters are similar, and the band chain characteristics of the log KPI curves generated by the same inducement for different event relationships are also similar.
In order to find the wave band chain, a sliding window with a proper width is adopted to slide along a log KPI curve, a log KPI curve unit segment is intercepted from the window, a plurality of wave bands with equal length are extracted from the log KPI curve unit segment, labels of all the wave bands in the log KPI curve unit segment are marked based on the similarity of characteristic fundamental waves and the wave bands, the log KPI curve unit segment is made into a wave band chain with label sequencing characteristics, thus, one wave band chain is obtained by sliding the window once on the log KPI curve, all the wave band chains are equal in length, only the classification labels of the wave bands are sequenced differently, based on the difference of sequencing characteristics of the wave band chains, after all the wave band chains obtained by the sliding window are arrayed according to time dimension, the causal relation of the wave band chains with different characteristics on the time dimension can be obtained based on sequence mining algorithm SPADE, expert evaluation and knowledge map fusion, the causal relationship between the event relationship and the event relationship is obtained, which is helpful for supplementing a knowledge system for fault identification in the system by experts and discovering the incidence relationship of monitoring indexes which are not discovered before, so that a new early warning control relationship and a new regulation threshold value can be established based on the newly discovered incidence relationship between the monitoring indexes in operation, and the system stability of each monitored object in the same system is improved.
The technical problem solved by the invention is similar to the feature compression code obtained by inputting the waveform to the self-coding network in the prior art CN110726898B, CN110726898B, and is equivalent to extracting the wave band chain based on the KPI curve or inducing the event tuple based on the fault log in the invention. Inputting the compressed codes into a classification model to obtain the type of the fault waveform, which is equivalent to the causal relationship of the band chain with different characteristics on the time dimension, which can be obtained based on the sequence mining algorithm SPADE, expert evaluation and knowledge map fusion of the invention; or just as entering event tuples into an existing fault event relationship table (classification model) and classifying the event tuples into associated event groups based on Snowball.
Drawings
FIG. 1 is a log KPI curve generated from fault logs generated by industrial control equipment in the same industrial control system;
FIG. 2 is a label chain of formed fundamental wave labels;
fig. 3 shows the categories of the log KPI curves generated from the fault log text and clustered.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention. In the following embodiments, the label chain and the band chain are the same meaning, and the KPI curve unit segment and the KPI curve window segment are the same meaning. The same industrial control system is composed of industrial control devices which have a direct or indirect material supply relationship, an electric energy transfer relationship, a heat energy transfer relationship, a mechanical energy transfer relationship, a magnetic field transfer relationship, an energy conversion relationship or a signal control relationship, the industrial control devices in the same industrial control system obtain fault logs based on monitoring indexes, and the fault logs also have correlation due to the fact that the monitoring indexes have correlation.
Example 1
A method for generating KPIs based on log keyword clustering comprises the following steps:
r1 collects fault logs obtained by industrial control equipment in the same power station industrial control system network based on monitoring indexes, event tuples are constructed according to the fault logs, and the fault logs are processed by a snowball algorithm to construct event relations.
The method for constructing the event tuple comprises the following steps:
f1 setting a training sentence subset composed of training sentences, extracting corpora from the fault log to respectively form a sentence pair to be processed with each training sentence, and respectively segmenting the sentences in the sentence pair based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;
f2, converting each feature word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold, wherein the threshold is set to be 0.9;
steps F1-F2 are used for picking out grammars from the fault logs, wherein the semantic structures are sentences used for referring, behavior recording and state description, and the general grammars of the fault logs in the industrial control system are as follows: the description structure of the sentence is less ambiguous, which is beneficial to removing error logs in fault logs and keeping industrial record logs;
segmenting the corpus by using a jieba.
def cut( sentence, cut_ all=False, HMM=True)
Wherein, the sensor is a sentence sample needing word segmentation; cut _ all is a word segmentation mode, jieba word segmentation has a full mode and an accurate mode, and is selected by true and false respectively, and the default is false, namely the accurate mode; HMMs are hidden markov chains that are used in theoretical models of word segmentation, which are turned on by default.
F3 performing word segmentation on the residual corpus in the step F2, forming a word segmentation queue by a plurality of characteristic words, and labeling part of speech on the plurality of characteristic words to obtain a part of speech queue of the corpus;
and returning the category code number to the input word by using a jieba. The Yangjun describes the use steps and the part-of-speech classification table of the jieba.
F4, if the part of speech queue contains a plurality of special feature words corresponding to special parts of speech, obtaining the boundary and the category of a named entity from the special feature words by using a named entity recognition model, and updating the parts of speech of the special feature words in the part of speech queue into the boundary and the category of the named entity to obtain a part of speech queue;
wherein, the special part of speech includes: the method comprises the following steps of (1) counting words and time words, wherein only numerical values and time are classified by parts of speech in the application scene of the embodiment, so that inaccurate identification is easy to occur; for example, in FIG. 3, pulsing the "16: 10:23 (set I) signal for the corpus" allows "the participle to obtain a part-of-speech queue," {16: m,: x,10: m,: x,23: m, (: x, set I: n,): x, signal: n,: v, pulse: n, allow: v } ", where: m, representing a number, x, representing a character string, n, representing a noun, and v, representing a verb. The part-of-speech queue obtained after processing in step F4 is: the step of {16:17:00: t, (: x, set i: n,): x, signal: n, occurrence: v, another channel: n, reception: v } ", avoids the part of speech of the time word which is difficult to identify being marked as a digit, so that the queue containing the time word and the queue containing the digit can be distinguished through the part of speech queue.
The named entity recognition model can recognize named nominal items from the linguistic data to be processed. In a narrow sense, four types of named entities, namely, a name of a person, a name of a place, a name of an organization and a name of a proper noun, are identified. It generally comprises two parts: (1) identifying entity boundaries; (2) entity categories (person name, place name, organization name, or others) are determined. There are a variety of ways to identify named entities, such as: the named entity recognition model may be constructed based on the above-described methods, such as rule-based methods, feature template-based methods, neural network-based methods, and the like.
For example: the named entity recognition model (CRF) carries out entity annotation on a sentence that I comes to the Tujia village, and the result after the accurate annotation is as follows: I/O to/O ceramic/B Home/M village/E (O means that the current word is not a geographical named entity, B ME means that the current word is the head inner tail of the geographical named entity, respectively). To solve this problem, a linear chain CRF is used, then (O, O, O, B, M, E) is one of its tagging sequences and (O, O, O, B, M, E) is also one of its tagging choices.
F5 classifies the residual corpus according to the label of F4 on the residual corpus, counts the occurrence frequency of part-of-speech queues of each category, and counts various types in the part-of-speech queues of each category: frequency of occurrence of verbs and nouns;
f6, sorting the part-of-speech queues of each category in a descending order according to the occurrence frequency of each verb and noun, sequentially screening two part-of-speech queue sets with the top rank from the two sorts according to a sorting threshold, extracting the corpus corresponding to the intersection of the two part-of-speech queue sets, and constructing a true training set;
f7, screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, and extracting a first participle and a second participle with parts-of-speech being nouns or proper nouns from the participle queue as an event I and an event II respectively to form an event tuple;
f8, finding the event association rule of the event tuple by using the Snowball algorithm, finding the association event group in the event tuple according to the event association rule:
c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is the length which can be set arbitrarily, < left > is the vector representation of len vocabularies on the left side of the event 1, < middle > is the vector representation of the vocabularies between the event 1 and the event 2, and < right > is the vector representation of len vocabularies on the right side of the event;
c2. clustering the generated templates, clustering the templates with similarity greater than 0.7 as a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; the template format can be written as known from step C2
Figure 368741DEST_PATH_IMAGE001
,E1、E2Respectively indicating an event 1 type and an event 2 type of the template P,
Figure 894400DEST_PATH_IMAGE002
represents E1The left 3-vocabulary length vector representation,
Figure 14627DEST_PATH_IMAGE017
represents E1,E2The vector representation of the vocabulary in between,
Figure 43763DEST_PATH_IMAGE004
represents E2Vector representation of three word lengths on the right, similarity calculation between templates, template 1:
Figure 284251DEST_PATH_IMAGE018
and a template 2:
Figure 31627DEST_PATH_IMAGE006
if the condition is satisfied
Figure 710870DEST_PATH_IMAGE008
I.e. satisfy the template P1Event 1 type E of1And a template P2Event 1 type of
Figure 594513DEST_PATH_IMAGE019
Identical and template P1Event 2 type E of2And a template P2Event 2 type of
Figure 271482DEST_PATH_IMAGE020
Same, then template P1And a template P2Can be determined by
Figure 240575DEST_PATH_IMAGE011
Calculated as mu1μ2μ3Are weighted because
Figure 457929DEST_PATH_IMAGE021
The calculation result of the similarity between the templates is greatly influenced, and mu can be set213(ii) a If the condition is not satisfied
Figure 664920DEST_PATH_IMAGE008
Then template P1And a template P2The similarity of (a) can be recorded as 0;
the averaging method is to average the vectors of the templates in the same class to generate a new template, which can refer to the snowball algorithm of relational extraction-programmer's bibliographic works.
Step C3., similarity calculation is carried out on the templates of the event tuples obtained in the step C1 and the templates in the rule base one by one, the templates with the similarity smaller than the threshold value of 0.7 are discarded, and the events in the templates with the similarity larger than the threshold value of 0.7 are added into the log key event relation table to replace the fault event relation table;
c4, repeating the steps C1-C3 until no template which can be discarded exists after the treatment of the step C3;
step R2 marks the fault log with each event relationship generated in step C4 as a log key event label.
As shown in fig. 1, the times of occurrence of each log key event label per minute is used as a monitoring index to establish each log KPI curve, and a gaussian kernel is used to smooth each log KPI curve;
step A9, marking according to the periodic classification of the KPI curves of the log;
carrying out periodic verification and inspection on the log KPI curves of each event relation, and marking the log KPI curves subjected to Gaussian kernel smoothing treatment according to periodic difference of the log KPI curves, wherein the labels are called log KPI curve period labels;
the step D1 periodic validation check includes the steps of:
extracting a frequency spectrum intensity graph of a log KPI curve by using Fourier transform;
z02, extracting the point with the highest vibration amplitude and calculating the corresponding period, namely the period to be checked;
and Z03, setting a hypothetical period, namely a waiting period, carrying out correlation strength detection on the waiting period if and only if the length of the waiting period is within the range of 95-105% of the expected period, and identifying the waiting period as a period meeting the requirement if the spectrum strength is sufficient.
Step A10 marking according to similarity classification of log KPI curves;
each log KPI curve mutually uses NCC algorithm to calculate pairwise similarity, and expands into a diagonal similarity matrix, and fills the similarity into the similarity matrix, wherein the serial numbers of rows and columns in the matrix are the serial numbers of the log KPI curves, the number of rows and columns in the similarity matrix is the number of the log KPI curves, and the numerical value in the similarity matrix is the similarity between the log KPI curves;
using a spectral clustering algorithm to mark different log KPI curve labels with clusters according to the similarity matrix to obtain a mapping relation (service implicit relation) of the log key event labels;
"https:// zhuanlan. zhihu. com/p/29849122" describes a classification method for spectral clustering.
Step A11 the KPI curve obtained in step A10 was pre-processed as in example 4.
Example 2
The method for marking waveband characteristics of the log KPI curve obtained based on the embodiment 1 comprises the following steps:
step A1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets with the time width of s minutesM i I is the segment number;
step A2, calculating Euclidean distances among all the sections of data sets according to the attributes of all the sections of log KPI curve data sets by using a dbscan algorithm, clustering the log KPI curve data sets of the sections i to obtain k clusters and abnormal items, wherein each cluster is a grouped data set, and each grouped data set has j sections of log KPI curve data setsF j
Step A3, calculating the arithmetic mean value, sigma, of j sections of log KPI curve data sets in each grouped data setF j /jAs the fundamental wave of the packet;
step A4, calculating each section of log KPI curve data set of each grouped data set by using NCC algorithmF j The waveform similarity with the fundamental wave is sorted from big to small, and the log KPI curve data sets with the waveform similarity sorted to the first 95 percent are recordedF j Taking the minimum value of the waveform similarity as the grouping boundary line of the groupB k
Step A5, calculating each log KPI curve data set by using NCC algorithmM i Waveform similarity with fundamental wave of each groupNCC M i-J k Judging whether each section of log KPI curve data set belongs to the group boundary line of each group as the referenceThe grouping is based on classification scores for a segment of log KPI curve data set belonging to multiple groups simultaneouslyQSorting is carried out, and log KPI curve data sets are obtainedM i Grouping to categorical scoreQIn the minimum grouping, the grouping information of each log KPI curve data set is obtained,
Q=((1-NCC M i-J k )/(1-B k ))2
NCC M i-J k the larger the size of the tube is,Qthe smaller theM i The more similar to cluster class k, the current log KPI curve datasetM i Similarity to different clustersNCC M i-J k When the phase of the mixture is the same as the phase of the mixture,B k smaller indicates the clusterM i Similarity to cluster kNCC M i-J k The more advanced in the waveform similarity ranking in the cluster class; by means of this formula the log KPI curve data set can be calculatedM i The likelihood among the candidate clusters, and thus which cluster is most likely to be.
A6, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;
step A7., performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;
event trigger interval, namely the time interval of two adjacent log KPI curve data sets in each grouped data set;
step A8., merging the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to NCC; if the time interval KPI sets of different clusters are similar, the waveforms of the clusters are similar in the total time width;
step A9., expanding the similarity of the time interval KPI sets among the clusters obtained in step A8 into a similarity matrix; as shown in table 1, a to d are serial numbers of clusters, the number of rows and columns of the similarity matrix is the number of clusters, the numerical value in the similarity matrix is the similarity of the time interval KPI sets between clusters, and the similarity matrix is a diagonal matrix;
Figure 43949DEST_PATH_IMAGE022
step A10, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;
step A11, replacing the similarity value of which the value is greater than the inflection point in the similarity matrix with 1, and replacing the similarity value of which the value is less than the inflection point with 0, as shown in Table 2;
Figure 267382DEST_PATH_IMAGE023
step A12, marking the similarity of 1 in the similarity matrix obtained in the step A11 and adjacent clusters as the same similar group, and counting the cluster number of each similar group;
step A13, calculating the total time interval of a group with the most clusters in the similarity group as the width of a sliding window;
setting the total time interval as the width of a sliding window, and dividing the log KPI curve into a plurality of segments by using the window, wherein the time width of each segment covers the similarity group with the maximum time length obtained in the substep S12. The sliding window is used for scanning the log KPI curve, the continuously appeared clusters can be quickly divided into a window and then quickly clustered to the same waveform category, the calculated amount is reduced, the wave bands of the log KPI curve can be integrally classified, and the possibility of missing knowledge is reduced.
The above-mentioned NCC (normalized cross correlation) algorithm is defined as:
Figure 22848DEST_PATH_IMAGE024
in the formula, xtAs a background waveform, yt+hThe value of NCC is between-1 and 1, wherein, -1 represents that the waveforms before and after transformation are opposite, 0 represents that the two waveforms are orthogonal, and 1 represents the same. The NCC only describes the macroscopic similarity degree of the two waveforms, and is not related to the amplitude of the waveforms and the energy attenuation.
Step A14, firstly, according to the sliding window obtained in the step A13, each log KPI curve obtained in the step F10 is divided into a plurality of log KPI curve window sections with the time sequence width as the total time interval, and according to the dividing method in the step A1, the log KPI curve window sections are divided into i-section log KPI curve data sets with the time sequence width of 1 minute
Figure 84345DEST_PATH_IMAGE013
Each segment is a band;
and B, carrying out similarity calculation on the fundamental waves obtained in the step A2 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain
Figure 899854DEST_PATH_IMAGE025
And sorting from large to small, in the wave band whose waveform similarity is sorted to top 95%, taking the minimum value of waveform similarity as group boundary line B of said group kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference
Figure 109119DEST_PATH_IMAGE013
Whether belonging to the group or not, for a segment of log KPI curve data set simultaneously belonging to a plurality of groups
Figure 402697DEST_PATH_IMAGE013
Score according to classification
Figure 318700DEST_PATH_IMAGE026
Sequencing is carried out, and a log KPI curve data set M is obtainediGrouping to categorical score
Figure 305111DEST_PATH_IMAGE026
In the smallest groups, forming fundamental tags as in FIG. 2The formed label chain obtains the mode waveforms of different KPIs, which are called KPI curve code pattern rearrangement table,
Figure 1671DEST_PATH_IMAGE027
the tag information obtained after the processing in step a14 contains all information of all bands, including two parts of band and waveform representation, the band tag has a fundamental wave type, and the waveform tag has two types, namely a service tag and a period tag.
In this way, each time a window is slid on a log KPI curve, one band chain is obtained, all band chains are equal in length, and only the sorting labels of the bands are different, in this embodiment, the curve characteristics of the log KPI curves of different monitoring indexes having a relationship are converted into the label chain sorting characteristics, and due to the relationship, although the amplitudes of the log KPI curves are different, the periods are similar to each other, the rhythm is similar, that is, the labels are arranged, so that a large number of KPI curves having a relationship can be unified into a standard and consistent label chain.
And A15, placing different KPI curve code pattern rearrangement tables in one dimension in a unified time dimension to obtain a KPI curve code pattern rearrangement association table.
If different log KPI curves use the same log KPI curve traffic label, there may be causal relationships where there is a higher probability of belonging to an aperiodic log KPI than to a periodic log KPI curve.
Different log KPI curves may have causal relationships if the same log KPI curve pattern fundamental signature is present in adjacent time segments, with a higher probability for more repetitions.
After all tag chains are arranged according to the time dimension, the sequence mining algorithm SPADE or GSP can be used for discovering the causal relationship between different tag chains occurring at different times, if two events always occur in pairs, the two events are considered to be related, and if one event always occurs before the other event, the causal relationship and the pre-causal effect between the two events are considered. The method is beneficial to supplementing a knowledge system for fault determination in the system by experts and discovering the incidence relation of monitoring indexes which are not discovered before, so that a new early warning control relation and a regulation and control threshold value can be established based on the incidence relation between the newly discovered monitoring indexes in operation, and the system stability of each monitored object in the same system is improved.

Claims (9)

1. A method for generating KPI curves based on log event relations comprises the following steps:
step F1, setting a training sentence subset consisting of training sentences, obtaining a fault log by the industrial control equipment in the same industrial control system based on monitoring indexes, forming a sentence pair to be processed by the corpora in the fault log and each training sentence respectively, calculating the similarity, and deleting the corpora with the similarity lower than a threshold value one;
step F2., performing word segmentation on the residual corpus in the step F1, generating a word segmentation queue consisting of a plurality of characteristic words, and labeling part of speech of the plurality of characteristic words to obtain a part of speech queue of the corpus;
step F3., if the part-of-speech queue contains a plurality of special feature words corresponding to special parts-of-speech, obtaining the boundary and category of the named entity from the plurality of special feature words by using the named entity recognition model, updating the parts-of-speech of the special feature words in the part-of-speech queue to the boundary and category of the named entity, and obtaining an updated part-of-speech queue, wherein the special parts-of-speech includes: number word, time word;
f4. classifying the residual corpus according to the label of F3, counting the occurrence frequency of each category part-of-speech queue, sorting in descending order, selecting the part-of-speech queue with the sequence greater than a second threshold value, counting the occurrence frequency of each verb and noun in each category part-of-speech queue, sorting in descending order, sequentially screening two parts-of-speech queue sets with the top rank from the sequence according to the occurrence frequency of verbs and nouns according to the sorting threshold value, extracting the corpus corresponding to the intersection of the two parts-of-speech queue sets, and constructing a true training set;
step F5., screening a participle queue with part-of-speech tagging combination of [ n, v, n ] from the corpus of the real training set, wherein n represents the part-of-speech of a noun, v represents the part-of-speech of a verb, and extracting a first participle and a second participle with parts-of-speech of a noun or a proper noun from the participle queue as an event I and an event II respectively to form an event tuple;
step F6, based on the existing fault event relation table, using Snowball algorithm to find the event association rule of the event tuple, and finding the association event group in the event tuple according to the event association rule, namely generating a log key event relation table;
step F7, repeatedly using the step F6 based on the log key event relation table until convergence;
step F8., taking each event relation generated in step F7 as a log key event label mark fault log, taking the times of occurrence of each log key event label per minute as a monitoring index, establishing each log KPI curve, and using a Gaussian kernel to smoothly process each log KPI curve;
step A1, merging data points of all minutes in all log KPI curves, dividing the data points into a plurality of band segments with time width of s minutes, clustering the band segments into a plurality of clusters according to non-time dimensions of the band segments, extracting fundamental waves of all the clusters, comparing similarity between band data of all the clusters and the fundamental waves, finding out grouping boundary lines of all the clusters, and grouping the band data of all the clusters;
a2, extracting the time stamps of all sections of log KPI curve data sets which are divided into different groups to obtain a time stamp list of each group;
step A3, performing step-by-step subtraction on the timestamp lists of each group, namely subtracting the starting timestamp of the next item in each timestamp list from the starting timestamp of the current item to obtain an event trigger interval list;
step A4, combining the event trigger intervals of each cluster into a time interval KPI set, and calculating the similarity between the time interval KPI sets of each cluster according to NCC;
step A5, expanding the similarity of the time interval KPI sets among the clusters obtained in the step A4 into a similarity matrix;
a6, sequentially ordering the similarity of the time interval KPI sets among the clusters according to the magnitude of the numerical values, fitting the numerical values of the similarity into a smooth line, and obtaining a boundary of the similarity of the time interval KPI sets among the clusters according to a knee point method;
step A7., marking adjacent clusters with numerical values larger than the inflection point in the similarity matrix as the same similar group, and counting the cluster number of each similar group;
step A8., calculating the total time interval of the group with the most clusters in the similarity group as the width of the sliding window;
step A9. is to divide each log KPI curve into several log KPI curve window segments with time sequence width as total time interval according to the sliding window obtained in step A8, and to divide the log KPI curve window segments into i-segment log KPI curve data sets with time sequence width of 1 minute according to the dividing method in step A1
Figure 638317DEST_PATH_IMAGE001
Each segment is a band;
comparing the similarity of each fundamental wave obtained in the step A1 with each wave band in each window of each log KPI curve one by one, sequencing the similarity from large to small, finding out grouping boundary lines according to the sequencing, grouping the wave bands to form a label chain formed by fundamental wave labels, and acquiring mode waveforms of different KPIs, wherein the mode waveforms are called KPI curve code pattern rearrangement tables;
and A10, placing different KPI curve pattern rearrangement tables in one dimension in a time dimension to obtain a KPI curve pattern rearrangement association table.
2. The method according to claim 1, wherein the calculating of the similarity in step F1 includes the steps of: respectively segmenting the sentences in the sentence pairs based on a pre-constructed corpus, wherein the pre-constructed corpus comprises an industry corpus and a common corpus;
and converting each characteristic word of the sentence after word segmentation into a word vector, respectively calculating the similarity of each sentence pair by using cosine similarity, and deleting the corpus if the similarity is lower than a threshold value one.
3. The method of claim 2, further comprising, after steps F8 and a 1:
extracting a frequency spectrum intensity graph of a log KPI curve by using Fourier transform;
z02, extracting the point with the highest vibration amplitude and calculating the corresponding period, namely the period to be checked;
and Z03, setting an assumed period, namely a waiting period, detecting the correlation strength of the period to be detected if and only if the length of the period to be detected is within the range of 95-105% of the expected period, determining the period to be detected as a period meeting the requirement if the spectrum strength is sufficient, and marking a filtered log KPI curve according to the periodic difference of the log KPI curves, namely a log KPI curve period label.
4. The method of claim 3, wherein step Z03 is further followed by:
the similarity matrix is filled with the similarities, the serial numbers of rows and columns in the matrix are the numbers of the log KPI curves, and the number of rows and columns of the similarity matrix is the number of the log KPI curves;
and Z05, outputting different clusters according to the similarity matrix by using a spectral clustering algorithm, and marking different log KPI curve labels, called KPI curve service labels, for the different clusters.
5. The method according to claim 1, wherein step F6 includes:
c1, matching a queue containing the events in the fault event relation table in the event tuple by using the existing fault event relation table, and generating a template; the format of the template is five-tuple form, which is < left >, event 1 type, < middle >, event 2 type, < right > respectively; len is an arbitrary set length, < left > is a vector representation of len words on the left side of the event 1, < middle > is a vector representation of words between the event 1 and the event 2, and < right > is a vector representation of len words on the right side of the event;
c2. clustering the generated templates, clustering the templates with similarity greater than the threshold value three into a class, generating a new template by using an averaging method, and adding the new template into a rule base for storing the templates; the template format known from step C2 is recorded as
Figure 86616DEST_PATH_IMAGE002
,E1、E2Respectively indicating an event 1 type and an event 2 type of the template P,
Figure 397512DEST_PATH_IMAGE003
represents E1The left 3-vocabulary length vector representation,
Figure 589459DEST_PATH_IMAGE004
represents E1、E2The vector representation of the words in between,
Figure 934989DEST_PATH_IMAGE005
represents E2Vector representation of three word lengths on the right, similarity calculation between templates, template 1:
Figure 491873DEST_PATH_IMAGE006
and (3) a template 2:
Figure 290064DEST_PATH_IMAGE007
if the condition is satisfied
Figure 285702DEST_PATH_IMAGE008
I.e. satisfy the template P1Event 1 type E of1And a template P2Type of event 1
Figure 987204DEST_PATH_IMAGE009
Identical and template P1Event 2 type E of2And a template P2Event 2 type of
Figure 714989DEST_PATH_IMAGE010
Same, then template P1And a template P2Is similar to
Figure 477DEST_PATH_IMAGE011
Calculated as mu1μ2μ3Are weighted because
Figure 534226DEST_PATH_IMAGE012
Setting mu with great influence on the calculation result of the similarity between the pair of templates213(ii) a If the condition is not satisfied
Figure 588770DEST_PATH_IMAGE008
Then template P1And a template P2The similarity of (2) is recorded as 0;
step C3., similarity calculation is carried out on the event tuple templates obtained in the step C1 and the templates in the rule base one by one, the similarity is abandoned if the similarity is smaller than the threshold value three, and the events in the templates with the similarity larger than the threshold value three are added into the log key event relation table to replace the fault event relation table.
6. The method of claim 1, wherein step a1 comprises the steps of:
step J1, extracting data point sets of all the log KPI curves in each minute into the same curve set L, and dividing the curve set L into a plurality of log KPI curve data sets M with the time width of s minutesiI is the segment number;
j2. calculating Euclidean distance between each segment of data sets according to the attribute of each segment of log KPI curve data set by using dbscan algorithm, clustering the log KPI curve data sets of i segments to obtain k clusters and abnormal items, wherein each cluster is a packet data set, and each packet data set has j segments of log KPI curve data sets Fj
Step J3. calculates an arithmetic mean value Σ F for j segments of log KPI curve data sets in each packet data setj(j) as the fundamental wave of the group;
step J4. uses the NCC algorithm to compute the log KPI curve data sets F for each packet data setjThe waveform similarity with the fundamental wave is sorted from big to small, and a log KPI curve data set F with the waveform similarity sorted to the first 95 percent is obtainedjTaking the minimum value of the waveform similarity as the valueGrouping boundary line B of groupsk
Step J5. uses the NCC algorithm to calculate each segment of the log KPI curve data set MiWaveform similarity with fundamental wave of each group
Figure 753035DEST_PATH_IMAGE013
Judging whether each section of log KPI curve data set belongs to the group or not by taking the group boundary line of each group as a reference, sequencing one section of log KPI curve data set simultaneously belonging to a plurality of groups according to the classification score Q, and sequencing a log KPI curve data set MiGrouping the data into groups with the minimum classification score Q to obtain grouping information of each log KPI curve data set,
Figure 525819DEST_PATH_IMAGE014
7. the method of claim 1, wherein step a7 is replaced with: replacing the similarity value with the value larger than the inflection point in the similarity matrix with 1, and replacing the similarity value with the value lower than the inflection point with 0;
and marking the similarity in the obtained similarity matrix as 1 and adjacent clusters as the same similar group, and counting the cluster number of each similar group.
8. The method of claim 1, wherein the step of segmenting the KPI curve window segment into bands in step a9 is: and B, carrying out similarity calculation on the fundamental waves obtained in the step A1 and the wave bands in each window of each log KPI curve one by using an NCC algorithm to obtain
Figure 597680DEST_PATH_IMAGE015
And sorting from large to small, in the wave band whose waveform similarity is sorted to top 95%, taking the minimum value of waveform similarity as group boundary line B of said group kJudging each section of log KPI curve data set by taking the grouping boundary line of each group as a reference
Figure 772309DEST_PATH_IMAGE001
Whether belonging to the group or not, for a segment of log KPI curve data set simultaneously belonging to a plurality of groups
Figure 841897DEST_PATH_IMAGE001
Score according to classification
Figure 101977DEST_PATH_IMAGE016
Sequencing is carried out, and a log KPI curve data set M is obtainediGrouping to categorical score
Figure 210485DEST_PATH_IMAGE016
In the minimum grouping, a label chain formed by fundamental wave labels is formed, mode waveforms of different KPIs are obtained, the mode waveforms are called KPI curve code pattern rearrangement tables,
Figure 505200DEST_PATH_IMAGE017
9. the method of claim 1, wherein after all tag chains are arranged according to the time dimension, causal relationships between different tag chains occurring at different times are discovered based on a sequence mining algorithm SPADE or GSP.
CN202210292597.6A 2022-03-18 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation Active CN114398898B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210292597.6A CN114398898B (en) 2022-03-24 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation
PCT/CN2023/082359 WO2023174431A1 (en) 2022-03-18 2023-03-17 Kpi curve data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210292597.6A CN114398898B (en) 2022-03-24 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation

Publications (2)

Publication Number Publication Date
CN114398898A CN114398898A (en) 2022-04-26
CN114398898B true CN114398898B (en) 2022-06-24

Family

ID=81234703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210292597.6A Active CN114398898B (en) 2022-03-18 2022-03-24 Method for generating KPI curve and marking wave band characteristics based on log event relation

Country Status (1)

Country Link
CN (1) CN114398898B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023174431A1 (en) * 2022-03-18 2023-09-21 三峡智控科技有限公司 Kpi curve data processing method
CN116405551B (en) * 2023-04-14 2024-03-29 深圳市优友网络科技有限公司 Social platform-based data pushing method and system and cloud platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326244A (en) * 2021-05-28 2021-08-31 中国科学技术大学 Abnormity detection method based on log event graph and incidence relation mining
CN114202009A (en) * 2021-09-27 2022-03-18 南开大学 Medical equipment performance index abnormity detection method and device based on PU learning

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055334B2 (en) * 2011-09-23 2021-07-06 Avaya Inc. System and method for aligning messages to an event based on semantic similarity
CN105573977A (en) * 2015-10-23 2016-05-11 苏州大学 Method and system for identifying Chinese event sequential relationship
SG10201801831QA (en) * 2018-03-06 2019-10-30 Agency Science Tech & Res Method And Apparatus For Predicting Occurrence Of An Event To Facilitate Asset Maintenance
CN110210019A (en) * 2019-05-21 2019-09-06 四川大学 A kind of event argument abstracting method based on recurrent neural network
CN111177505A (en) * 2019-12-31 2020-05-19 中国移动通信集团江苏有限公司 Training method, recommendation method and device of index anomaly detection model
CN111738308A (en) * 2020-06-03 2020-10-02 浙江中烟工业有限责任公司 Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning
CN112966079B (en) * 2021-03-02 2022-09-30 中国电子科技集团公司第二十八研究所 Event portrait oriented text analysis method for dialog system
CN113312447B (en) * 2021-03-10 2022-07-12 天津大学 Semi-supervised log anomaly detection method based on probability label estimation
CN113723452B (en) * 2021-07-19 2024-05-28 山西三友和智慧信息技术股份有限公司 Large-scale anomaly detection system based on KPI clustering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326244A (en) * 2021-05-28 2021-08-31 中国科学技术大学 Abnormity detection method based on log event graph and incidence relation mining
CN114202009A (en) * 2021-09-27 2022-03-18 南开大学 Medical equipment performance index abnormity detection method and device based on PU learning

Also Published As

Publication number Publication date
CN114398898A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
US11194865B2 (en) Hybrid approach to approximate string matching using machine learning
US11449673B2 (en) ESG-based company evaluation device and an operation method thereof
CN114398898B (en) Method for generating KPI curve and marking wave band characteristics based on log event relation
CN110196906B (en) Deep learning text similarity detection method oriented to financial industry
CN114398891B (en) Method for generating KPI curve and marking wave band characteristics based on log keywords
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN102411563A (en) Method, device and system for identifying target words
CN114386538B (en) Method for marking wave band characteristics of KPI (Key performance indicator) curve of monitoring index
WO2023174431A1 (en) Kpi curve data processing method
CN109934251B (en) Method, system and storage medium for recognizing text in Chinese language
CN112699375B (en) Block chain intelligent contract security vulnerability detection method based on network embedded similarity
CN106528527A (en) Identification method and identification system for out of vocabularies
CN110866169B (en) Learning-based Internet of things entity message analysis method
Hussain et al. Design and analysis of news category predictor
Elouataoui et al. An End-to-End Big Data Deduplication Framework based on Online Continuous Learning
CN112905793A (en) Case recommendation method and system based on Bilstm + Attention text classification
Kharisma et al. Comparison of Naïve Bayes Algorithm Model Combinations with Term Weighting Techniques in Sentiment Analysis
CN114880584B (en) Generator set fault analysis method based on community discovery
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN113610112B (en) Auxiliary decision-making method for aircraft assembly quality defects
Duan et al. A Neural Network-Powered Cognitive Method of Identifying Semantic Entities in Earth Science Papers
CN116049396A (en) False news detection method based on pre-training model fusion
CN115209441A (en) Method, device, equipment and storage medium for predicting base station out-of-service alarm
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
CN113901223B (en) Method, device, computer equipment and storage medium for generating enterprise classification model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant