CN107301118B - A kind of fault indices automatic marking method and system based on log - Google Patents
A kind of fault indices automatic marking method and system based on log Download PDFInfo
- Publication number
- CN107301118B CN107301118B CN201710450900.XA CN201710450900A CN107301118B CN 107301118 B CN107301118 B CN 107301118B CN 201710450900 A CN201710450900 A CN 201710450900A CN 107301118 B CN107301118 B CN 107301118B
- Authority
- CN
- China
- Prior art keywords
- fault
- failure
- log
- indices
- performance indicator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3447—Performance evaluation by modeling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
Abstract
The present invention relates to a kind of fault indices automatic marking system and method based on log, it include: that failure/fault log is filtered out according to the event class of system log, it is that every failure/fault log distributes failure/fault classification according to the information content of every failure/fault log, the effective time window of performance achievement data is determined according to every class failure/fault classification;The corresponding all properties achievement data of the effective time window of every class failure/fault classification is modeled, fault indices model is constructed;Whether automatic marking performance indicator data, which are fault indices, is carried out to performance indicator data according to fault indices model.The present invention can reduce taking time and effort for artificial mark fault indices, save time and human resources, reduce workload, facilitate the quick Check System failure of administrator, carry out fault diagnosis, it can also go out whether system in certain period is in certain failure/fault state according to the feature assessment of index, it is convenient to take corresponding measure in time.
Description
Technical field
The present invention relates to distributed environment reliability field, in particular to a kind of fault indices automatic marking based on log
Method and system.
Background technique
Fault diagnosis refers to: for finding a kind of technology of abnormality in system or system component.With software system
The scale of system constantly becomes larger, and logic becomes more complicated, and the difficulty of fault diagnosis is also increasing.On the one hand, large scale system
In not necessarily have careful monitoring capacity;On the other hand, due to the presence of some fault tolerant mechanisms, failure sometimes can't be intuitive
It shows.Fault diagnosis technology primarily can be used for the deficiency of discovery system.
Currently, fault diagnosis technology is just constantly incorporating new computing technique and mathematical method, including artificial intelligence, engineering
Habit, random process, Bayesian inference, graph theory etc..Main fault diagnosis technology and its advantage and disadvantage are set forth below: rule-based
Technology, for rule-based technology mainly by carrying out fault diagnosis for Expert Knowledge Expression is series of rules, rule is people
It is expansible and interpretable, but this technology cannot diagnose unknown mistake, and a large amount of knowledge base is also not convenient for safeguarding;Base
In the technology of model, system is defined as mathematical notation by the technology based on model, is to verify by testing the behavior observed
It is no to meet model, the technology based on model be suitble to diagnostic application rank the problem of, however, building model need to have very system
Deep understanding;Statistical technique, statistical technique is by carrying out empirical data using association analysis, comparison and probability scheduling theory
Fault diagnosis, statistical technique do not need the understanding for having deep to internal system or model, but the unstable state for system
(unexpected, reasonable) failure is difficult to diagnose, and this kind of unstable state failure is very common for large scale system;
Machine learning techniques, machine learning techniques are using the method recognition mode clustered, or system shape is determined using training data
Whether state healthy, looks for out of order potential cause, machine learning techniques can automatically learning system behavior, but work as feature dimensions
When degree becomes larger, accuracy can decline rapidly;It counts and threshold technology, counting and threshold technology can be diagnosed to be of short duration and intermittent
Mistake, this technology are largely dependent upon the correction of parameter, can matching by stringent mathematical formulae and analysis model
Set parameter;Visualization technique, visualization technique identify abnormal point by the trend and mode of visualized data, it can be to asking
The root that topic occurs has various it is assumed that still this technology can not automatic recognition problem.
The data source of fault diagnosis includes: the fault diagnosis based on log at present, and log is the main letter of fault diagnosis
One of source is ceased, the fault diagnosis based on log refers to the means using system log, RAS log etc. as fault diagnosis, from
Fault signature and failure rule etc. are extracted in log, are subsequently used for fault diagnosis and failure prediction;Failure based on index is examined
Disconnected, achievement data is another information source of fault diagnosis, and the fault diagnosis based on index refers to analysis system operational process
Various indexs, such as cpu utilization rate, memory usage, network bandwidth, memory bandwidth, IPC (Instruction Per
Performed instruction is how much in Clock, i.e. CPU each clock cycle), Cache crash rate etc., the hair of Lai Jinhang failure and exception
Existing, diagnosis and prediction.In general, using the method for diagnosing faults based on index, mainly by based on achievement data when
Sequence is analyzed and outlier threshold is arranged to find failure;However, which type of achievement data is to belong to fault indices, current method
Mainly is determined and marked by manually.
At present using the fault diagnosis scheme of artificial mark index, which indication range category can be filtered out by manually in advance
In the range of fault indices or this pre-define the fault signature of index, i.e., by manually mark come to achievement data carry out
Classification (failure, normal), reuses relevant diagnostic techniques to carry out fault diagnosis.The method of artificial mark fault indices due to
The participation for needing people, not only takes time and effort, and increases the workload of user, and is difficult to realize automate.
Patent " a kind of failure modes diagnostic method based on non-index of similarity ", the principle of the patent is to pass through first
Feature variables selection is carried out with reference to fault type to every kind, the feature change of normal data can be different from by selecting this kind of failure most
Amount, then, online fault data window and each non-phase of the distribution with reference to fault data window is compared using characteristic variable two-by-two
Like degree, the fault type that on-line checking goes out then corresponds to obtain the reference fault type of minimum non-index of similarity.This method is logical
Similarity matching of the window data in spatial distribution is crossed to implement fault diagnosis, can maximumlly avoid the misclassification of overlapped data
Situation.Compared to the uncertainty for the reference fault type progress feature variables selection that above-mentioned patent proposes, the present invention is according to receipts
The syslogseverity severity level of the log collected can determine whether the log is failure or fault log, in conjunction with
Message information carry out text similarity classification, further determine that its-failure/fault classification.Secondly, according to every class failure come
It determines the effective time window ranges of fault indices, (SVDD oneclass classification algorithm) then is trained to similar fault indices,
Construct such fault indices model.
Paper " the software fault detection research based on PU study ", design is realized and is only built with positive example and unlabeled data herein
Mould it is available with supervised learning method similar in software fault detection rate, and integrated classifier method is than single classifier method
With higher verification and measurement ratio, the software fault detection rate for not marking sample set size equally has an impact.It is a small number of using synthesis first
Class over-sampling SMOTE algorithm carries out over-sampling to the positive example sample that data are concentrated, the class distribution that equilibrium data is concentrated.It is basic herein
Upper reasonable construction positive example set and unlabeled set close, integrated using POSC4.5 and Bagging algorithm building software fault decision tree
Classifier.Compared to the limitation that above-mentioned paper is modeled only with positive example and unlabeled data, the present invention is in terms of data collection
More comprehensively, automatic marking is carried out to achievement data in combination with system log.
The prior art is during carrying out fault detection and automatic diagnosis using system index data, in detection different application
When all kinds of achievement datas generated in the process of running, which index when belonging to thrashing or failure corresponding index be difficult to boundary
It is fixed, it usually needs targetedly to be analyzed by manually checking system action and index timing diagram, and even if user can be with
According to the experience of oneself come to exception or fault indices preset threshold value, but index is different with different units, example
Such as, cpu utilization rate and memory usage are percent value, and disk memory bandwidth and network transmission bandwidth are MB/s, etc., then
When facing numerous system indexs, how the threshold value of fault indices is chosen, for users and a problem.
In view of the above problems, the failure/fault event in present invention combination system log, to be automatically that fault indices carry out
Mark, after the training of a period of time, so that it may carry out according only to trained obtained fault indices collection based on index
Fault diagnosis.
Summary of the invention
The purpose of the present invention is carrying out automatic marking to fault indices, to reduce artificial mark.It specifically includes: how to mention
The feature of fault indices, and automatic marking fault indices are taken, present invention aims at by combining the severity level in log
And information, performance indicator corresponding to failure/fault log is found out, and these performance indicators are modeled, to construct event
Hinder index model, is then based on fault indices model for the automatic marking to fault indices.
Specifically, the invention discloses a kind of fault indices automatic marking system based on log, including:
Log collection module is used to collect multiple pieces of system log from distributed system or standalone computer systems;
Index collection module, for acquiring the performance indicator data in the distributed system or the standalone computer systems;
Data processing module, for filtering out failure/fault log according to the event class of the system log first, secondly
It is that every failure/fault log distributes failure/fault classification according to the information content of every failure/fault log, finally
The effective time window of the performance indicator data is determined according to every class failure/fault classification;
Off-line modeling module, for corresponding all property of the effective time window to every class failure/fault classification
Energy achievement data is modeled, and fault indices model is constructed;
Fault indices labeling module is somebody's turn to do for carrying out automatic marking to the performance indicator data according to the fault indices model
Whether performance indicator data are fault indices.
The fault indices automatic marking system based on log, wherein the index collection module, is spaced at regular intervals
Performance indicator data in acquisition system, the performance indicator data include cpu utilization rate, memory usage, disk read-write bandwidth,
IPC, cache miss rate etc..
The fault indices automatic marking system based on log, wherein the data processing module include:
Time window division module, the timestamp by inquiring with the failure/fault log of similar failure/fault classification
And preset time window threshold value is combined, determine the effective time window of each failure/fault classification, and have by searching for this
The timestamp for imitating performance indicator in time window, obtains the performance indicator data in the effective time window ranges.
The fault indices automatic marking system based on log, wherein the off-line modeling module include:
Summarizing module, for performance indicator number corresponding to all effective time windows by every class failure/fault classification
According to summarizing for fault indices group;
Model construction module, for using a kind of target classification to the performance indicator data in each fault indices group
Algorithm constructs the fault indices model of the failure type.
The fault indices automatic marking system based on log, the wherein fault indices labeling module, by calculating the property
Can locally peel off probability between achievement data and the fault indices model, mark whether the performance indicator data are fault indices.
The present invention also provides a kind of fault indices automatic marking method based on log, including:
Log collection step collects multiple pieces of system log from distributed system or standalone computer systems;
Index collection step acquires performance indicator data in the distributed system or the standalone computer systems;
Data processing step according to the event class of the system log filters out failure/fault log first, secondly basis
The information content of every failure/fault log is that every failure/fault log distributes failure/fault classification, last basis
Every class failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling step refers to corresponding all performances of the effective time window of every class failure/fault classification
Mark data are modeled, and fault indices model is constructed;
Fault indices annotation step carries out the automatic marking performance to the performance indicator data according to the fault indices model
Whether achievement data is fault indices.
The fault indices automatic marking method based on log, wherein the index collection step, is spaced at regular intervals
Performance indicator data in acquisition system, the performance indicator data include cpu utilization rate, memory usage, disk read-write bandwidth,
IPC, cache miss rate etc..
The fault indices automatic marking method based on log, wherein the data processing step includes: that time window divides
Step, the timestamp by inquiring the failure/fault log with similar failure/fault classification simultaneously combine the preset time
Window threshold value determines the effective time window of each failure/fault classification, and by searching for performance indicator in the effective time window
Timestamp, obtain the performance indicator data in the effective time window ranges.
The fault indices automatic marking method based on log, wherein the off-line modeling step include:
Performance indicator data corresponding to aggregation step, all effective time windows by every class failure/fault classification are converged
It is always fault indices group;
Model construction step, to the performance indicator data in each fault indices group, using a kind of target classification algorithm,
Construct the fault indices model of the failure type.
The fault indices automatic marking method based on log, the wherein fault indices annotation step, by calculating the property
Can locally peel off probability between achievement data and the fault indices model, mark whether the performance indicator data are fault indices.
Whether technological progress of the invention is, be that fault indices are marked automatically to system performance index based on log
Note can not only reduce taking time and effort for artificial mark fault indices, can save time and human resources, reduce workload, side
Just the quick Check System failure of administrator, progress fault diagnosis;Moreover, it is also possible to directly go out some time according to the feature assessment of index
Between in section system whether be in the solution of certain failure/fault, it is convenient to take corresponding measure in time.
Detailed description of the invention
Fig. 1 is that the present invention is based on the fault indices automatic marking methods and system flow chart of log;
Fig. 2 is the Text similarity computing schematic diagram of failure/fault event and failure message;
Fig. 3 is SVDD classification schematic diagram.
Specific embodiment
To allow features described above and effect of the invention that can illustrate more clearly understandable, special embodiment below, and cooperate
Bright book attached drawing is described in detail below.
Thinking of the present invention is, by collection system log and system performance index data, and according to the serious of system log
Property grade finds out failure or fault log, in conjunction with every event in failure/fault log message to failure/fault log
Failure classification is carried out, fail category present in system (failure type) or fault category are found out.Secondly, according to every class
Fail category (failure type)/fault category determines the effective time window ranges of fault indices;Then to similar event
Barrier index is trained (for example, using SVDD oneclass classification algorithm), constructs such fault indices model.Finally, passing through
Establish the fault indices model library of fail category, then, later can directly using established fault indices model library come
Automatic marking is carried out to performance indicator.For example, only acquire performance indicator and lack log system in, can be by system
Performance indicator and fault indices model, which are compared, (can peel off probability using part to carry out the comparison of model, but not limit to
In this algorithm), then whether the performance indicator in automatic marking certain time period is fault indices, for finding the mistake of system
Effect or failure.
The Linux system log that the present invention acquires will record from/var/log/messagesLinux operation system SystemCommon system and service error message.
System log format is exemplified below (journal format can be by self-definings such as log collection tool rsyslog):
The present invention is a kind of fault indices automatic marking method and system based on log, this method from distributed system or
Refer to acquisition index data etc. in the system log of standalone computer systems collection, and serious according to the syslogseverity of log
Property grade collection system log in fail or failure log, in conjunction with message carry out text similarity classification, it is sorted
Classification is as fault category.Secondly, determining the effective time window ranges of fault indices according to every class failure.Then to similar
Fault indices are trained (for example, SVDD oneclass classification algorithm), construct such fault indices model, then using built
Vertical fault indices model to carry out automatic marking to performance indicator.
As shown in Figure 1, fault indices automatic marking system is made of five modules, it is log collection module, index respectively
Acquisition module, data processing module, off-line modeling module and fault indices labeling module.
Log collection module: for collecting multiple pieces of system log.
Index collection module: for the performance indicator data (achievement data) in acquisition system.
Data processing module: failure or event are found out for the severity level (event class) according to linux system log
Hinder log, in conjunction with every event in failure/fault log message to failure/fault log carry out failure classification to for
Every failure/fault log distributes fail category (failure type), fail category or failure classes present in statistical system
Not;Secondly, determining the effective time window of performance achievement data according to every class fail category (failure type)/fault category
Mouth range.
Off-line modeling module: for the corresponding performance indicator data of all effective time windows to same failure type into
Row modeling, constructs fault indices model.
Fault indices labeling module: for being marked automatically using established fault indices model library to performance indicator
Note.
The specific works content of above-mentioned each module is introduced below, wherein data collection module includes the above-mentioned log
Collection module and index collection module, log collection module, for collecting linux system log, system log is record system
The information of middle hardware, software and system problem, while can event to occur in monitoring system.User can be examined by it
The reason of debugging accidentally occurs, or the trace that leaves of attacker when finding under attack.Log collection module in the present invention can
Come collection system log, to be record in system log using existing Open-Source Tools Rsyslog or other log collection tools
The sequence of events occurred in system, manager can be by checking that system status is grasped in system log at any time.For example, can receive
The journal format of the linux system log syslog collected is as shown in the table:
The description of 1 system log format of table
Index collection module, the achievement data for being spaced in (1s) acquisition system at regular intervals, in index collection
In module, it can use the tools such as perf and carry out acquisition system in the process of running respectively in system layer and microbody system structure layer
Performance indicator data, achievement data include cpu utilization rate, memory usage, disk read-write bandwidth, IPC (Instruction
Performed instruction is how much in Per Clock, i.e. CPU each clock cycle), the indexs such as cache miss rate, as shown in table 2 below,
Performance indicator can be the index in following table 2, but be not limited to following index.
The achievement data Verbose Listing that table 2 acquires
Data processing module includes: preprocessing module, is used for event severity grade in system log
Syslogseverity, according to increasing severity of clinical sequential classification be " info, warning, error, fatal/failure,
Failed " etc..
Table 3syslogseverity classification chart
Finding out index syslogseverityText in linux system log includes " fatal/failure, Failed " etc.
Field, and these logs are denoted as to the log of thrashing or failure;
Failure/fault log citing in table 4syslog
After finding out thrashing/fault log, in conjunction with every event in failure/fault log message to failure/
Fault log carries out failure classification, finds out fail category present in system (failure type).Herein, text can be used
This similarity algorithm (without limitation to algorithm) carries out failure modes.Herein, the present invention quotes one and is simply based on
Message text similarity carries out the algorithm [1] of failure classification.Shown in detailed process is as follows:
1 first is set by the failure type of first failure/fault event, is deposited into fail category table.Again will
The message for having failure class in the Next Failure/event of failure message and fail category table carries out a text phase
Like the calculating of degree.The algorithm of text similarity is described in detail below:
A. firstly, MT is marked to the message for having failure class in fail category tablei。
Ti={ mti1,mti2,...,mtim}(1<i<k) (1)
Wherein, mti1,mti2,...,mtimIt indicates to have every in each information type in the message of failure class
One word, m indicate the total words in an information type, and k indicates that the information type having in the message of failure class is total
Number.
B. secondly, the word for including in the message of failure/fault event is labeled as Mj。
Mj={ mj1,mj2,...,mjn} (2)
Wherein, mj1,mj2,...,mjnIndicate that each of the message of failure/fault event word, n indicate one
Total words in information.
C. then, MT is soughtiAnd MjBetween similitude Sij:
D. the similarity S of the message of similarity failure/fault event is finally calculated:
S=Sij/n (4)
A similarity threshold γ is set, if the message similarity of failure/fault event is greater than the similarity threshold of setting
Value γ, then the corresponding fail category of this failure/fault event is existing failure type in fail category table, and should
The corresponding fail category of event is set as the fail category with that highest message of its similarity value;If similarity is small
In threshold gamma, then it is added to fail category for the message of this failure/fault event as a new failure type
In table.And so on, successively using sorted classification as fault category.
Lower act specific embodiment, which adds, is further described the data processing module, as shown in figure 3, known failure/event of failure
Message is fatal:qmgr_active_feed:2A21B3FB015:rename from deferred to active:
Read-only file system, as shown in Figure 1, it is MT respectively that having in failure type list, which includes two kinds of failure types,1
And MT2, wherein MT1=fatal, qmgr_active_feed, C0089459A93, rename, from, deferred, to,
Active, Read-only, file, system } (m=11);
MT2={ Device ,/dev/sdd, FAILED, SMART, self-check., BACK, UP, DATANOW } (m=9)
B. as shown in Figure 1, the set of letters for including of the message of failure/fault event
Mj=fatal, qmgr_active_feed, 2A21B3FB015, rename, from, deferred, to,
Active, Read-only, file, system } (n=11)
C. as shown in Figure 1, because there is 10 words to meet mjl∈MT1, therefore,Then MT1And MjBetween
Similitude S1j/ n=10/11=0.91
D. similarity threshold γ=0.6 is set, if the message similarity 0.91 of failure/fault event is greater than threshold
Value 0.6, then the corresponding fail category of this failure/fault event is existing failure type " 1 " in fail category table.
Data processing module further include: time window division module, for having for the failure type in Query System Log
Time window is imitated, then by starting, the end in the effective time window where some failure event type in system log
The timestamp of timestamp and achievement data collected is associated.
1. inquiring the effective time window of failure type: having the failure/fault of similar failure/fault classification by inquiring
The timestamp of log determines the effective time window of each failure/fault classification, and by searching for performance in the effective time window
The timestamp of index obtains the performance indicator data in the effective time window ranges.Such as, it is first determined the failure of concern
Then the timestamp of some failure event of type searches for downwards the thing either with or without similar failure type upwards on time dimension
Part if so, just extending downwardly time window upwards, or searches downwards the similarity of log message upwards, then will
The similar log message timestamp of similarity is as time window.Herein, a time window threshold value Th_ can be set
Tw just terminates the failure if the time window searched for downwards upwards on time dimension has been more than set threshold value Th_tw
The inquiry of type of time stamp, and by the failure type with earliest time stamp failure event arrive with latest time stab mistake
Effective time window of the time window as the failure type between effect event.
As shown in table 5, three of three event of failure (syslogseverity fatal) that failiure type is 1
Log message is more similar, by inquiring the effective time window of the failure type, if by search upwards not in morning
It is found in the range of timestamp " 2008-10-2903:44:01 " 10min (here, Th_tw threshold value is arranged for 10min)
There is similar failure type 1, be not later than timestamp " 2008-10-2903:44:01 " 10min (here, by searching for downwards
By Th_tw threshold value be arranged for 10min) in the range of discovery have similar failure type 1, so that it may determine the failure type
Effective time window is 2008-10-2902:10:29 to 2008-10-2903:44:01.The effective time window of other failure types
Mouth is analogized by this.
Timestamp citing in table 5syslog where failure/fault event
2. then, from starting, ending time stamp and the achievement data collected in the effective time window of failure type
Timestamp it is associated, determine the time window range of achievement data.For following table 6, failure class is found by previous step
The effective time window of type 1 is (2008-10-2902:10:29 to 2008-10-2903:44:01), then, when effective according to this
Between at the beginning of window stamp and ending time stamp go to search the timestamp of the performance indicator acquired, and obtain the effective time window
Performance indicator data in mouth range.If new effective time window that is subsequent and having found the failure type, still according to above-mentioned
Method obtain the corresponding performance indicator data of the effective time window.
The correlation distance of the timestamp of 6 failure type effective time window of table and achievement data
Off-line modeling module is responsible for off-line training, constructs fault indices model, comprising: summarizing module, for every class to be lost
Performance indicator data summarization corresponding to all effective time windows of effect/fault category is fault indices group;Model construction mould
Block, for using a kind of target classification algorithm, constructing the failure class to the performance indicator data in each fault indices group
The fault indices model of type.
Specifically, performance indicator data corresponding to all effective time windows of each failure type are counted, then will
These performance indicator data summarizations are a fault indices group (failure_metrics_group);Then, to each failure
The failure_metrics_group of type uses a kind of target classification based on achievement data point included in the Group
Algorithm constructs the fault indices model of the failure type, for example, SVDD algorithm can be used, trains the failure_
The SVDD model of group, as its fault indices model, wherein using the achievement data point inside SVDD hypersphere as this
The fault indices of failure type, the achievement data point outside SVDD hypersphere is as the index point that peels off.
SVDD algorithm principle and realization.As shown in Fig. 2, SVDD (support vector domain description),
Its principle and SVM are like can be used to be one class svm.The optimization aim of SVDD is exactly to ask a center for a, and half
Diameter is the minimum spherical surface of R.The basic thought of algorithm is, as soon as it is then trained one the smallest super due to only existing a class
(hypersphere refers to the spherical surface in the space of 3 dimensions or more to spherical surface, is exactly curve in corresponding 2 dimension space, is exactly ball in 3 dimension spaces
Face, above referred to as hypersphere of 3 dimensions), this heap data is all wrapped, when identifying a new data point, if this number
Strong point is fallen in hypersphere, is exactly this class, is not otherwise.
Based on the above method, a fault indices model can be all constructed to the failure type being found, it then can structure
At a fault indices model library.The fault indices model library, can be carried out in conjunction with newly-increased log information it is continuous update and
Modification.
Fault indices labeling module.By previous step, the fault indices model of existing failure_group is had been set up
Library directly can carry out automatic marking to performance indicator using established fault indices model library in subsequent work.
For example, only acquire performance indicator and lack log system in, can be by the performance indicator data and fault indices of system
Model is compared and (can peel off probability using part to carry out the comparison of model, but be not limited to this algorithm), then certainly
Whether the performance indicator in dynamic mark certain time period is fault indices, for finding the failure or failure of system.
It specifically, include the SVDD model of a variety of failure_group in fault indices model library, it is new when getting
When achievement data, algorithm can be peeled off using part to train new achievement data, calculate new data into all SVDD models
The probability that peels off of heart point a, meanwhile, the threshold value outlier_probability_threshold of the probability that peels off is set, if
The probability that peels off of new data to part svdd model center point a are less than outlier_probability_threshold, then selecting
That the smallest model of the probability that peels off is selected, automatic marking new data belongs to the SVDD model of the fault indices;If new data arrives
The probability that peels off of all svdd model center point a is all not less than outlier_probability_threshold, then the index
Data are not belonging to any one of fault indices model library svdd model.
The principle and realization of probability (LOOP) algorithm for example, part here peels off
LOOP algorithm is a kind of unsupervised data digging method, is applied to outlier detection field earliest.LOOP value
Size shows that a sample is the size of outlier probability.
The performance indicator of collected system is expressed as xi first, then obtains all center point sets of SVDD model
all_set(a);Secondly xi is calculated to the probability set distance of central point according to the set, then estimates the distribution of index around xi
Density is simultaneously defined as probability and locally peels off the factor;The standard deviation of the factor that peels off finally is calculated, and utilizes Gauss error function meter
It calculates part and peels off probability loop (xi).Loop (xi) value range is [0,1].Loop (xi) value is bigger, and xi is the index point that peels off
Probability it is bigger.
The following are embodiment of the method corresponding with the above system embodiment, present embodiment can be mutual with above embodiment
Cooperation is implemented.The above-mentioned relevant technical details mentioned in mode of applying are still effective in the present embodiment, in order to reduce repetition, this
In repeat no more.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above embodiment.
The present invention also provides a kind of fault indices automatic marking method based on log, including:
Log collection step collects multiple pieces of system log from distributed system or standalone computer systems;
Index collection step acquires performance indicator data in the distributed system or the standalone computer systems;
Data processing step according to the event class of the system log filters out failure/fault log first, secondly basis
The information content of every failure/fault log is that every failure/fault log distributes failure/fault classification, last basis
Every class failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling step refers to corresponding all performances of the effective time window of every class failure/fault classification
Mark data are modeled, and fault indices model is constructed;
Fault indices annotation step carries out the automatic marking performance to the performance indicator data according to the fault indices model
Whether achievement data is fault indices.
The fault indices automatic marking method based on log, wherein the index collection step, is spaced at regular intervals
Performance indicator data in acquisition system, the performance indicator data include cpu utilization rate, memory usage, disk read-write bandwidth,
IPC, cache miss rate.
The fault indices automatic marking method based on log, wherein the data processing step includes: that time window divides
Step, the timestamp by inquiring the failure/fault log with similar failure/fault classification simultaneously combine the preset time
Window threshold value determines the effective time window of each failure/fault classification, and by searching for performance indicator in the effective time window
Timestamp, obtain the performance indicator data in the effective time window ranges.
The fault indices automatic marking method based on log, wherein the off-line modeling step include:
Performance indicator data corresponding to aggregation step, all effective time windows by every class failure/fault classification are converged
It is always fault indices group;
Model construction step, to the performance indicator data in each fault indices group, using a kind of target classification algorithm,
Construct the fault indices model of the failure type.
The fault indices automatic marking method based on log, the wherein fault indices annotation step, by calculating the property
Can locally peel off probability between achievement data and the fault indices model, mark whether the performance indicator data are fault indices.
Although the present invention is disclosed with above-described embodiment, specific examples are only used to explain the present invention, is not used to limit
The present invention, any those skilled in the art of the present technique without departing from the spirit and scope of the invention, can make some change and complete
It is kind, therefore the scope of the present invention is subject to claims.
Claims (8)
1. a kind of fault indices automatic marking system based on log characterized by comprising
Log collection module is used to collect multiple pieces of system log from distributed system or standalone computer systems;
Index collection module, for acquiring the performance indicator data in the distributed system or the standalone computer systems;
Data processing module, for filtering out failure/fault log according to the event class of the system log first, secondly basis
The information content of every failure/fault log is that every failure/fault log distributes failure/fault classification, last basis
Every class failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling module refers to for corresponding all performances of the effective time window to every class failure/fault classification
Mark data are modeled, and fault indices model is constructed;
Fault indices labeling module is used to carry out the automatic marking performance to the performance indicator data according to the fault indices model
Whether achievement data is fault indices;
Wherein the data processing module includes:
Time window division module, the timestamp and knot by inquiring with the failure/fault log of similar failure/fault classification
Close preset time window threshold value, determine the effective time window of each failure/fault classification, and it is effective by searching for this when
Between in window performance indicator timestamp, obtain the performance indicator data in the effective time window ranges.
2. the fault indices automatic marking system based on log as described in claim 1, which is characterized in that the index collection mould
Block, the performance indicator data being spaced in acquisition system at regular intervals, which includes cpu utilization rate, memory
Utilization rate, disk read-write bandwidth, IPC, cache miss rate.
3. the fault indices automatic marking system based on log as described in claim 1, which is characterized in that the off-line modeling mould
Block includes:
Summarizing module is converged for performance indicator data corresponding to all effective time windows by every class failure/fault classification
It is always fault indices group;
Model construction module, for the performance indicator data in each fault indices group, using a kind of target classification algorithm,
Construct the fault indices model of the failure/fault classification.
4. the fault indices automatic marking system based on log as described in claim 1, which is characterized in that the fault indices mark
Injection molding block marks the performance indicator number by calculating the probability that locally peels off between the performance indicator data and the fault indices model
According to whether being fault indices.
5. a kind of fault indices automatic marking method based on log characterized by comprising
Log collection step collects multiple pieces of system log from distributed system or standalone computer systems;
Index collection step acquires performance indicator data in the distributed system or the standalone computer systems;
Data processing step according to the event class of the system log filters out failure/fault log first, secondly according to every
The information content of the failure/fault log is that every failure/fault log distributes failure/fault classification, finally according to every class
The failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling step, to corresponding all performance indicator numbers of the effective time window of every class failure/fault classification
According to being modeled, fault indices model is constructed;
Fault indices annotation step carries out the automatic marking performance indicator to the performance indicator data according to the fault indices model
Whether data are fault indices;
Wherein the data processing step includes: time window partiting step, by inquiring the mistake with similar failure/fault classification
Effect/fault log timestamp simultaneously combines preset time window threshold value, determines the effective time of each failure/fault classification
Window, and by searching for the timestamp of performance indicator in the effective time window, obtain being somebody's turn to do in the effective time window ranges
Performance indicator data.
6. the fault indices automatic marking method based on log as claimed in claim 5, which is characterized in that index collection step
Suddenly, the performance indicator data being spaced at regular intervals in acquisition system, which includes cpu utilization rate, memory
Utilization rate, disk read-write bandwidth, IPC, cache miss rate.
7. the fault indices automatic marking method based on log as claimed in claim 5, which is characterized in that off-line modeling step
Suddenly include:
Performance indicator data summarization corresponding to aggregation step, all effective time windows by every class failure/fault classification is
Fault indices group;
Model construction step, to the performance indicator data in each fault indices group, use a kind of target classification algorithm, building
The fault indices model of the failure type out.
8. the fault indices automatic marking method based on log as claimed in claim 5, which is characterized in that the fault indices mark
It infuses step and marks the performance indicator number by calculating the probability that locally peels off between the performance indicator data and the fault indices model
According to whether being fault indices.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710450900.XA CN107301118B (en) | 2017-06-15 | 2017-06-15 | A kind of fault indices automatic marking method and system based on log |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710450900.XA CN107301118B (en) | 2017-06-15 | 2017-06-15 | A kind of fault indices automatic marking method and system based on log |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107301118A CN107301118A (en) | 2017-10-27 |
CN107301118B true CN107301118B (en) | 2019-11-19 |
Family
ID=60134736
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710450900.XA Active CN107301118B (en) | 2017-06-15 | 2017-06-15 | A kind of fault indices automatic marking method and system based on log |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107301118B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280021A (en) * | 2018-01-25 | 2018-07-13 | 郑州云海信息技术有限公司 | A kind of logging level analysis method based on machine learning |
CN110245012A (en) * | 2018-03-08 | 2019-09-17 | 中国移动通信集团广东有限公司 | A kind of loose type virtualization resource dispatching method and system |
CN108848512B (en) * | 2018-05-30 | 2021-04-30 | 江南大学 | SVDD wireless sensor network outlier data detection method |
CN111367747B (en) * | 2018-12-25 | 2023-07-04 | 中国移动通信集团浙江有限公司 | Index abnormal detection early warning device based on time annotation |
CN109918313B (en) * | 2019-03-29 | 2021-04-02 | 武汉大学 | GBDT decision tree-based SaaS software performance fault diagnosis method |
CN110691070B (en) * | 2019-09-07 | 2022-02-11 | 温州医科大学 | Network abnormity early warning method based on log analysis |
CN111309572B (en) * | 2020-02-13 | 2021-05-04 | 上海复深蓝软件股份有限公司 | Test analysis method and device, computer equipment and storage medium |
CN113535759A (en) * | 2020-04-14 | 2021-10-22 | 中国移动通信集团上海有限公司 | Data labeling method, device, equipment and medium |
CN116724296A (en) * | 2021-10-26 | 2023-09-08 | 微软技术许可有限责任公司 | Performing hardware fault detection based on multimodal feature fusion |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101888309A (en) * | 2010-06-30 | 2010-11-17 | 中国科学院计算技术研究所 | Online log analysis method |
CN103761173A (en) * | 2013-12-28 | 2014-04-30 | 华中科技大学 | Log based computer system fault diagnosis method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9734005B2 (en) * | 2014-10-31 | 2017-08-15 | International Business Machines Corporation | Log analytics for problem diagnosis |
-
2017
- 2017-06-15 CN CN201710450900.XA patent/CN107301118B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101888309A (en) * | 2010-06-30 | 2010-11-17 | 中国科学院计算技术研究所 | Online log analysis method |
CN103761173A (en) * | 2013-12-28 | 2014-04-30 | 华中科技大学 | Log based computer system fault diagnosis method and device |
Non-Patent Citations (2)
Title |
---|
云环境下基于统计监测的分布式软件系统故障检测技术研究;王焘;《计算机学报》;20170228;第40卷(第02期);全文 * |
基于日志分析的虚拟化环境故障定位系统;田斐等;《计算机系统应用》;20141231;第23卷(第11期);第1-7页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107301118A (en) | 2017-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107301118B (en) | A kind of fault indices automatic marking method and system based on log | |
De Santo et al. | Deep Learning for HDD health assessment: An application based on LSTM | |
US11048729B2 (en) | Cluster evaluation in unsupervised learning of continuous data | |
CN106021062B (en) | The prediction technique and system of relevant fault | |
Klinkenberg et al. | Data mining-based analysis of HPC center operations | |
CN109871401A (en) | A kind of time series method for detecting abnormality and device | |
Bifet et al. | Improving adaptive bagging methods for evolving data streams | |
Chug et al. | Software defect prediction using supervised learning algorithm and unsupervised learning algorithm | |
CN107133632A (en) | A kind of wind power equipment fault diagnosis method and system | |
CN111177714A (en) | Abnormal behavior detection method and device, computer equipment and storage medium | |
CN108415810B (en) | Hard disk state monitoring method and device | |
CN110188834A (en) | A kind of method for diagnosing faults of power telecom network, device and equipment | |
CN107111610A (en) | Mapper component for neural language performance identifying system | |
CN106030565A (en) | Computer performance prediction using search technologies | |
CN107004200A (en) | The evaluated off-line of ranking function | |
CN112951311A (en) | Hard disk fault prediction method and system based on variable weight random forest | |
Han et al. | Statistical estimation of diffusion network topologies | |
Buda et al. | ADE: An ensemble approach for early Anomaly Detection | |
Yu et al. | Filtering log data: Finding the needles in the haystack | |
Agrawal et al. | Analyzing and predicting failure in hadoop clusters using distributed hidden markov model | |
WO2015065379A1 (en) | Parameter suggestion based on user activity | |
Zhang et al. | Failure prediction in ibm bluegene/l event logs | |
Theron | The use of data mining for predicting injuries in professional football players | |
CN106487592B (en) | A kind of Faults in Distributed Systems diagnostic method based on data cube | |
Yu et al. | Detecting of PIU behaviors based on discovered generators and emerging patterns from computer-mediated interaction events |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |