CN107301118B - A kind of fault indices automatic marking method and system based on log - Google Patents

A kind of fault indices automatic marking method and system based on log Download PDF

Info

Publication number
CN107301118B
CN107301118B CN201710450900.XA CN201710450900A CN107301118B CN 107301118 B CN107301118 B CN 107301118B CN 201710450900 A CN201710450900 A CN 201710450900A CN 107301118 B CN107301118 B CN 107301118B
Authority
CN
China
Prior art keywords
fault
failure
log
indices
performance indicator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710450900.XA
Other languages
Chinese (zh)
Other versions
CN107301118A (en
Inventor
任睿
殷岩
程杰超
詹剑锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201710450900.XA priority Critical patent/CN107301118B/en
Publication of CN107301118A publication Critical patent/CN107301118A/en
Application granted granted Critical
Publication of CN107301118B publication Critical patent/CN107301118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems

Abstract

The present invention relates to a kind of fault indices automatic marking system and method based on log, it include: that failure/fault log is filtered out according to the event class of system log, it is that every failure/fault log distributes failure/fault classification according to the information content of every failure/fault log, the effective time window of performance achievement data is determined according to every class failure/fault classification;The corresponding all properties achievement data of the effective time window of every class failure/fault classification is modeled, fault indices model is constructed;Whether automatic marking performance indicator data, which are fault indices, is carried out to performance indicator data according to fault indices model.The present invention can reduce taking time and effort for artificial mark fault indices, save time and human resources, reduce workload, facilitate the quick Check System failure of administrator, carry out fault diagnosis, it can also go out whether system in certain period is in certain failure/fault state according to the feature assessment of index, it is convenient to take corresponding measure in time.

Description

A kind of fault indices automatic marking method and system based on log
Technical field
The present invention relates to distributed environment reliability field, in particular to a kind of fault indices automatic marking based on log Method and system.
Background technique
Fault diagnosis refers to: for finding a kind of technology of abnormality in system or system component.With software system The scale of system constantly becomes larger, and logic becomes more complicated, and the difficulty of fault diagnosis is also increasing.On the one hand, large scale system In not necessarily have careful monitoring capacity;On the other hand, due to the presence of some fault tolerant mechanisms, failure sometimes can't be intuitive It shows.Fault diagnosis technology primarily can be used for the deficiency of discovery system.
Currently, fault diagnosis technology is just constantly incorporating new computing technique and mathematical method, including artificial intelligence, engineering Habit, random process, Bayesian inference, graph theory etc..Main fault diagnosis technology and its advantage and disadvantage are set forth below: rule-based Technology, for rule-based technology mainly by carrying out fault diagnosis for Expert Knowledge Expression is series of rules, rule is people It is expansible and interpretable, but this technology cannot diagnose unknown mistake, and a large amount of knowledge base is also not convenient for safeguarding;Base In the technology of model, system is defined as mathematical notation by the technology based on model, is to verify by testing the behavior observed It is no to meet model, the technology based on model be suitble to diagnostic application rank the problem of, however, building model need to have very system Deep understanding;Statistical technique, statistical technique is by carrying out empirical data using association analysis, comparison and probability scheduling theory Fault diagnosis, statistical technique do not need the understanding for having deep to internal system or model, but the unstable state for system (unexpected, reasonable) failure is difficult to diagnose, and this kind of unstable state failure is very common for large scale system; Machine learning techniques, machine learning techniques are using the method recognition mode clustered, or system shape is determined using training data Whether state healthy, looks for out of order potential cause, machine learning techniques can automatically learning system behavior, but work as feature dimensions When degree becomes larger, accuracy can decline rapidly;It counts and threshold technology, counting and threshold technology can be diagnosed to be of short duration and intermittent Mistake, this technology are largely dependent upon the correction of parameter, can matching by stringent mathematical formulae and analysis model Set parameter;Visualization technique, visualization technique identify abnormal point by the trend and mode of visualized data, it can be to asking The root that topic occurs has various it is assumed that still this technology can not automatic recognition problem.
The data source of fault diagnosis includes: the fault diagnosis based on log at present, and log is the main letter of fault diagnosis One of source is ceased, the fault diagnosis based on log refers to the means using system log, RAS log etc. as fault diagnosis, from Fault signature and failure rule etc. are extracted in log, are subsequently used for fault diagnosis and failure prediction;Failure based on index is examined Disconnected, achievement data is another information source of fault diagnosis, and the fault diagnosis based on index refers to analysis system operational process Various indexs, such as cpu utilization rate, memory usage, network bandwidth, memory bandwidth, IPC (Instruction Per Performed instruction is how much in Clock, i.e. CPU each clock cycle), Cache crash rate etc., the hair of Lai Jinhang failure and exception Existing, diagnosis and prediction.In general, using the method for diagnosing faults based on index, mainly by based on achievement data when Sequence is analyzed and outlier threshold is arranged to find failure;However, which type of achievement data is to belong to fault indices, current method Mainly is determined and marked by manually.
At present using the fault diagnosis scheme of artificial mark index, which indication range category can be filtered out by manually in advance In the range of fault indices or this pre-define the fault signature of index, i.e., by manually mark come to achievement data carry out Classification (failure, normal), reuses relevant diagnostic techniques to carry out fault diagnosis.The method of artificial mark fault indices due to The participation for needing people, not only takes time and effort, and increases the workload of user, and is difficult to realize automate.
Patent " a kind of failure modes diagnostic method based on non-index of similarity ", the principle of the patent is to pass through first Feature variables selection is carried out with reference to fault type to every kind, the feature change of normal data can be different from by selecting this kind of failure most Amount, then, online fault data window and each non-phase of the distribution with reference to fault data window is compared using characteristic variable two-by-two Like degree, the fault type that on-line checking goes out then corresponds to obtain the reference fault type of minimum non-index of similarity.This method is logical Similarity matching of the window data in spatial distribution is crossed to implement fault diagnosis, can maximumlly avoid the misclassification of overlapped data Situation.Compared to the uncertainty for the reference fault type progress feature variables selection that above-mentioned patent proposes, the present invention is according to receipts The syslogseverity severity level of the log collected can determine whether the log is failure or fault log, in conjunction with Message information carry out text similarity classification, further determine that its-failure/fault classification.Secondly, according to every class failure come It determines the effective time window ranges of fault indices, (SVDD oneclass classification algorithm) then is trained to similar fault indices, Construct such fault indices model.
Paper " the software fault detection research based on PU study ", design is realized and is only built with positive example and unlabeled data herein Mould it is available with supervised learning method similar in software fault detection rate, and integrated classifier method is than single classifier method With higher verification and measurement ratio, the software fault detection rate for not marking sample set size equally has an impact.It is a small number of using synthesis first Class over-sampling SMOTE algorithm carries out over-sampling to the positive example sample that data are concentrated, the class distribution that equilibrium data is concentrated.It is basic herein Upper reasonable construction positive example set and unlabeled set close, integrated using POSC4.5 and Bagging algorithm building software fault decision tree Classifier.Compared to the limitation that above-mentioned paper is modeled only with positive example and unlabeled data, the present invention is in terms of data collection More comprehensively, automatic marking is carried out to achievement data in combination with system log.
The prior art is during carrying out fault detection and automatic diagnosis using system index data, in detection different application When all kinds of achievement datas generated in the process of running, which index when belonging to thrashing or failure corresponding index be difficult to boundary It is fixed, it usually needs targetedly to be analyzed by manually checking system action and index timing diagram, and even if user can be with According to the experience of oneself come to exception or fault indices preset threshold value, but index is different with different units, example Such as, cpu utilization rate and memory usage are percent value, and disk memory bandwidth and network transmission bandwidth are MB/s, etc., then When facing numerous system indexs, how the threshold value of fault indices is chosen, for users and a problem.
In view of the above problems, the failure/fault event in present invention combination system log, to be automatically that fault indices carry out Mark, after the training of a period of time, so that it may carry out according only to trained obtained fault indices collection based on index Fault diagnosis.
Summary of the invention
The purpose of the present invention is carrying out automatic marking to fault indices, to reduce artificial mark.It specifically includes: how to mention The feature of fault indices, and automatic marking fault indices are taken, present invention aims at by combining the severity level in log And information, performance indicator corresponding to failure/fault log is found out, and these performance indicators are modeled, to construct event Hinder index model, is then based on fault indices model for the automatic marking to fault indices.
Specifically, the invention discloses a kind of fault indices automatic marking system based on log, including:
Log collection module is used to collect multiple pieces of system log from distributed system or standalone computer systems;
Index collection module, for acquiring the performance indicator data in the distributed system or the standalone computer systems;
Data processing module, for filtering out failure/fault log according to the event class of the system log first, secondly It is that every failure/fault log distributes failure/fault classification according to the information content of every failure/fault log, finally The effective time window of the performance indicator data is determined according to every class failure/fault classification;
Off-line modeling module, for corresponding all property of the effective time window to every class failure/fault classification Energy achievement data is modeled, and fault indices model is constructed;
Fault indices labeling module is somebody's turn to do for carrying out automatic marking to the performance indicator data according to the fault indices model Whether performance indicator data are fault indices.
The fault indices automatic marking system based on log, wherein the index collection module, is spaced at regular intervals Performance indicator data in acquisition system, the performance indicator data include cpu utilization rate, memory usage, disk read-write bandwidth, IPC, cache miss rate etc..
The fault indices automatic marking system based on log, wherein the data processing module include:
Time window division module, the timestamp by inquiring with the failure/fault log of similar failure/fault classification And preset time window threshold value is combined, determine the effective time window of each failure/fault classification, and have by searching for this The timestamp for imitating performance indicator in time window, obtains the performance indicator data in the effective time window ranges.
The fault indices automatic marking system based on log, wherein the off-line modeling module include:
Summarizing module, for performance indicator number corresponding to all effective time windows by every class failure/fault classification According to summarizing for fault indices group;
Model construction module, for using a kind of target classification to the performance indicator data in each fault indices group Algorithm constructs the fault indices model of the failure type.
The fault indices automatic marking system based on log, the wherein fault indices labeling module, by calculating the property Can locally peel off probability between achievement data and the fault indices model, mark whether the performance indicator data are fault indices.
The present invention also provides a kind of fault indices automatic marking method based on log, including:
Log collection step collects multiple pieces of system log from distributed system or standalone computer systems;
Index collection step acquires performance indicator data in the distributed system or the standalone computer systems;
Data processing step according to the event class of the system log filters out failure/fault log first, secondly basis The information content of every failure/fault log is that every failure/fault log distributes failure/fault classification, last basis Every class failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling step refers to corresponding all performances of the effective time window of every class failure/fault classification Mark data are modeled, and fault indices model is constructed;
Fault indices annotation step carries out the automatic marking performance to the performance indicator data according to the fault indices model Whether achievement data is fault indices.
The fault indices automatic marking method based on log, wherein the index collection step, is spaced at regular intervals Performance indicator data in acquisition system, the performance indicator data include cpu utilization rate, memory usage, disk read-write bandwidth, IPC, cache miss rate etc..
The fault indices automatic marking method based on log, wherein the data processing step includes: that time window divides Step, the timestamp by inquiring the failure/fault log with similar failure/fault classification simultaneously combine the preset time Window threshold value determines the effective time window of each failure/fault classification, and by searching for performance indicator in the effective time window Timestamp, obtain the performance indicator data in the effective time window ranges.
The fault indices automatic marking method based on log, wherein the off-line modeling step include:
Performance indicator data corresponding to aggregation step, all effective time windows by every class failure/fault classification are converged It is always fault indices group;
Model construction step, to the performance indicator data in each fault indices group, using a kind of target classification algorithm, Construct the fault indices model of the failure type.
The fault indices automatic marking method based on log, the wherein fault indices annotation step, by calculating the property Can locally peel off probability between achievement data and the fault indices model, mark whether the performance indicator data are fault indices.
Whether technological progress of the invention is, be that fault indices are marked automatically to system performance index based on log Note can not only reduce taking time and effort for artificial mark fault indices, can save time and human resources, reduce workload, side Just the quick Check System failure of administrator, progress fault diagnosis;Moreover, it is also possible to directly go out some time according to the feature assessment of index Between in section system whether be in the solution of certain failure/fault, it is convenient to take corresponding measure in time.
Detailed description of the invention
Fig. 1 is that the present invention is based on the fault indices automatic marking methods and system flow chart of log;
Fig. 2 is the Text similarity computing schematic diagram of failure/fault event and failure message;
Fig. 3 is SVDD classification schematic diagram.
Specific embodiment
To allow features described above and effect of the invention that can illustrate more clearly understandable, special embodiment below, and cooperate Bright book attached drawing is described in detail below.
Thinking of the present invention is, by collection system log and system performance index data, and according to the serious of system log Property grade finds out failure or fault log, in conjunction with every event in failure/fault log message to failure/fault log Failure classification is carried out, fail category present in system (failure type) or fault category are found out.Secondly, according to every class Fail category (failure type)/fault category determines the effective time window ranges of fault indices;Then to similar event Barrier index is trained (for example, using SVDD oneclass classification algorithm), constructs such fault indices model.Finally, passing through Establish the fault indices model library of fail category, then, later can directly using established fault indices model library come Automatic marking is carried out to performance indicator.For example, only acquire performance indicator and lack log system in, can be by system Performance indicator and fault indices model, which are compared, (can peel off probability using part to carry out the comparison of model, but not limit to In this algorithm), then whether the performance indicator in automatic marking certain time period is fault indices, for finding the mistake of system Effect or failure.
The Linux system log that the present invention acquires will record from/var/log/messagesLinux operation system SystemCommon system and service error message.
System log format is exemplified below (journal format can be by self-definings such as log collection tool rsyslog):
The present invention is a kind of fault indices automatic marking method and system based on log, this method from distributed system or Refer to acquisition index data etc. in the system log of standalone computer systems collection, and serious according to the syslogseverity of log Property grade collection system log in fail or failure log, in conjunction with message carry out text similarity classification, it is sorted Classification is as fault category.Secondly, determining the effective time window ranges of fault indices according to every class failure.Then to similar Fault indices are trained (for example, SVDD oneclass classification algorithm), construct such fault indices model, then using built Vertical fault indices model to carry out automatic marking to performance indicator.
As shown in Figure 1, fault indices automatic marking system is made of five modules, it is log collection module, index respectively Acquisition module, data processing module, off-line modeling module and fault indices labeling module.
Log collection module: for collecting multiple pieces of system log.
Index collection module: for the performance indicator data (achievement data) in acquisition system.
Data processing module: failure or event are found out for the severity level (event class) according to linux system log Hinder log, in conjunction with every event in failure/fault log message to failure/fault log carry out failure classification to for Every failure/fault log distributes fail category (failure type), fail category or failure classes present in statistical system Not;Secondly, determining the effective time window of performance achievement data according to every class fail category (failure type)/fault category Mouth range.
Off-line modeling module: for the corresponding performance indicator data of all effective time windows to same failure type into Row modeling, constructs fault indices model.
Fault indices labeling module: for being marked automatically using established fault indices model library to performance indicator Note.
The specific works content of above-mentioned each module is introduced below, wherein data collection module includes the above-mentioned log Collection module and index collection module, log collection module, for collecting linux system log, system log is record system The information of middle hardware, software and system problem, while can event to occur in monitoring system.User can be examined by it The reason of debugging accidentally occurs, or the trace that leaves of attacker when finding under attack.Log collection module in the present invention can Come collection system log, to be record in system log using existing Open-Source Tools Rsyslog or other log collection tools The sequence of events occurred in system, manager can be by checking that system status is grasped in system log at any time.For example, can receive The journal format of the linux system log syslog collected is as shown in the table:
The description of 1 system log format of table
Index collection module, the achievement data for being spaced in (1s) acquisition system at regular intervals, in index collection In module, it can use the tools such as perf and carry out acquisition system in the process of running respectively in system layer and microbody system structure layer Performance indicator data, achievement data include cpu utilization rate, memory usage, disk read-write bandwidth, IPC (Instruction Performed instruction is how much in Per Clock, i.e. CPU each clock cycle), the indexs such as cache miss rate, as shown in table 2 below, Performance indicator can be the index in following table 2, but be not limited to following index.
The achievement data Verbose Listing that table 2 acquires
Data processing module includes: preprocessing module, is used for event severity grade in system log Syslogseverity, according to increasing severity of clinical sequential classification be " info, warning, error, fatal/failure, Failed " etc..
Table 3syslogseverity classification chart
Finding out index syslogseverityText in linux system log includes " fatal/failure, Failed " etc. Field, and these logs are denoted as to the log of thrashing or failure;
Failure/fault log citing in table 4syslog
After finding out thrashing/fault log, in conjunction with every event in failure/fault log message to failure/ Fault log carries out failure classification, finds out fail category present in system (failure type).Herein, text can be used This similarity algorithm (without limitation to algorithm) carries out failure modes.Herein, the present invention quotes one and is simply based on Message text similarity carries out the algorithm [1] of failure classification.Shown in detailed process is as follows:
1 first is set by the failure type of first failure/fault event, is deposited into fail category table.Again will The message for having failure class in the Next Failure/event of failure message and fail category table carries out a text phase Like the calculating of degree.The algorithm of text similarity is described in detail below:
A. firstly, MT is marked to the message for having failure class in fail category tablei
Ti={ mti1,mti2,...,mtim}(1<i<k) (1)
Wherein, mti1,mti2,...,mtimIt indicates to have every in each information type in the message of failure class One word, m indicate the total words in an information type, and k indicates that the information type having in the message of failure class is total Number.
B. secondly, the word for including in the message of failure/fault event is labeled as Mj
Mj={ mj1,mj2,...,mjn} (2)
Wherein, mj1,mj2,...,mjnIndicate that each of the message of failure/fault event word, n indicate one Total words in information.
C. then, MT is soughtiAnd MjBetween similitude Sij:
D. the similarity S of the message of similarity failure/fault event is finally calculated:
S=Sij/n (4)
A similarity threshold γ is set, if the message similarity of failure/fault event is greater than the similarity threshold of setting Value γ, then the corresponding fail category of this failure/fault event is existing failure type in fail category table, and should The corresponding fail category of event is set as the fail category with that highest message of its similarity value;If similarity is small In threshold gamma, then it is added to fail category for the message of this failure/fault event as a new failure type In table.And so on, successively using sorted classification as fault category.
Lower act specific embodiment, which adds, is further described the data processing module, as shown in figure 3, known failure/event of failure Message is fatal:qmgr_active_feed:2A21B3FB015:rename from deferred to active: Read-only file system, as shown in Figure 1, it is MT respectively that having in failure type list, which includes two kinds of failure types,1 And MT2, wherein MT1=fatal, qmgr_active_feed, C0089459A93, rename, from, deferred, to, Active, Read-only, file, system } (m=11);
MT2={ Device ,/dev/sdd, FAILED, SMART, self-check., BACK, UP, DATANOW } (m=9)
B. as shown in Figure 1, the set of letters for including of the message of failure/fault event
Mj=fatal, qmgr_active_feed, 2A21B3FB015, rename, from, deferred, to, Active, Read-only, file, system } (n=11)
C. as shown in Figure 1, because there is 10 words to meet mjl∈MT1, therefore,Then MT1And MjBetween Similitude S1j/ n=10/11=0.91
D. similarity threshold γ=0.6 is set, if the message similarity 0.91 of failure/fault event is greater than threshold Value 0.6, then the corresponding fail category of this failure/fault event is existing failure type " 1 " in fail category table.
Data processing module further include: time window division module, for having for the failure type in Query System Log Time window is imitated, then by starting, the end in the effective time window where some failure event type in system log The timestamp of timestamp and achievement data collected is associated.
1. inquiring the effective time window of failure type: having the failure/fault of similar failure/fault classification by inquiring The timestamp of log determines the effective time window of each failure/fault classification, and by searching for performance in the effective time window The timestamp of index obtains the performance indicator data in the effective time window ranges.Such as, it is first determined the failure of concern Then the timestamp of some failure event of type searches for downwards the thing either with or without similar failure type upwards on time dimension Part if so, just extending downwardly time window upwards, or searches downwards the similarity of log message upwards, then will The similar log message timestamp of similarity is as time window.Herein, a time window threshold value Th_ can be set Tw just terminates the failure if the time window searched for downwards upwards on time dimension has been more than set threshold value Th_tw The inquiry of type of time stamp, and by the failure type with earliest time stamp failure event arrive with latest time stab mistake Effective time window of the time window as the failure type between effect event.
As shown in table 5, three of three event of failure (syslogseverity fatal) that failiure type is 1 Log message is more similar, by inquiring the effective time window of the failure type, if by search upwards not in morning It is found in the range of timestamp " 2008-10-2903:44:01 " 10min (here, Th_tw threshold value is arranged for 10min) There is similar failure type 1, be not later than timestamp " 2008-10-2903:44:01 " 10min (here, by searching for downwards By Th_tw threshold value be arranged for 10min) in the range of discovery have similar failure type 1, so that it may determine the failure type Effective time window is 2008-10-2902:10:29 to 2008-10-2903:44:01.The effective time window of other failure types Mouth is analogized by this.
Timestamp citing in table 5syslog where failure/fault event
2. then, from starting, ending time stamp and the achievement data collected in the effective time window of failure type Timestamp it is associated, determine the time window range of achievement data.For following table 6, failure class is found by previous step The effective time window of type 1 is (2008-10-2902:10:29 to 2008-10-2903:44:01), then, when effective according to this Between at the beginning of window stamp and ending time stamp go to search the timestamp of the performance indicator acquired, and obtain the effective time window Performance indicator data in mouth range.If new effective time window that is subsequent and having found the failure type, still according to above-mentioned Method obtain the corresponding performance indicator data of the effective time window.
The correlation distance of the timestamp of 6 failure type effective time window of table and achievement data
Off-line modeling module is responsible for off-line training, constructs fault indices model, comprising: summarizing module, for every class to be lost Performance indicator data summarization corresponding to all effective time windows of effect/fault category is fault indices group;Model construction mould Block, for using a kind of target classification algorithm, constructing the failure class to the performance indicator data in each fault indices group The fault indices model of type.
Specifically, performance indicator data corresponding to all effective time windows of each failure type are counted, then will These performance indicator data summarizations are a fault indices group (failure_metrics_group);Then, to each failure The failure_metrics_group of type uses a kind of target classification based on achievement data point included in the Group Algorithm constructs the fault indices model of the failure type, for example, SVDD algorithm can be used, trains the failure_ The SVDD model of group, as its fault indices model, wherein using the achievement data point inside SVDD hypersphere as this The fault indices of failure type, the achievement data point outside SVDD hypersphere is as the index point that peels off.
SVDD algorithm principle and realization.As shown in Fig. 2, SVDD (support vector domain description), Its principle and SVM are like can be used to be one class svm.The optimization aim of SVDD is exactly to ask a center for a, and half Diameter is the minimum spherical surface of R.The basic thought of algorithm is, as soon as it is then trained one the smallest super due to only existing a class (hypersphere refers to the spherical surface in the space of 3 dimensions or more to spherical surface, is exactly curve in corresponding 2 dimension space, is exactly ball in 3 dimension spaces Face, above referred to as hypersphere of 3 dimensions), this heap data is all wrapped, when identifying a new data point, if this number Strong point is fallen in hypersphere, is exactly this class, is not otherwise.
Based on the above method, a fault indices model can be all constructed to the failure type being found, it then can structure At a fault indices model library.The fault indices model library, can be carried out in conjunction with newly-increased log information it is continuous update and Modification.
Fault indices labeling module.By previous step, the fault indices model of existing failure_group is had been set up Library directly can carry out automatic marking to performance indicator using established fault indices model library in subsequent work. For example, only acquire performance indicator and lack log system in, can be by the performance indicator data and fault indices of system Model is compared and (can peel off probability using part to carry out the comparison of model, but be not limited to this algorithm), then certainly Whether the performance indicator in dynamic mark certain time period is fault indices, for finding the failure or failure of system.
It specifically, include the SVDD model of a variety of failure_group in fault indices model library, it is new when getting When achievement data, algorithm can be peeled off using part to train new achievement data, calculate new data into all SVDD models The probability that peels off of heart point a, meanwhile, the threshold value outlier_probability_threshold of the probability that peels off is set, if The probability that peels off of new data to part svdd model center point a are less than outlier_probability_threshold, then selecting That the smallest model of the probability that peels off is selected, automatic marking new data belongs to the SVDD model of the fault indices;If new data arrives The probability that peels off of all svdd model center point a is all not less than outlier_probability_threshold, then the index Data are not belonging to any one of fault indices model library svdd model.
The principle and realization of probability (LOOP) algorithm for example, part here peels off
LOOP algorithm is a kind of unsupervised data digging method, is applied to outlier detection field earliest.LOOP value Size shows that a sample is the size of outlier probability.
The performance indicator of collected system is expressed as xi first, then obtains all center point sets of SVDD model all_set(a);Secondly xi is calculated to the probability set distance of central point according to the set, then estimates the distribution of index around xi Density is simultaneously defined as probability and locally peels off the factor;The standard deviation of the factor that peels off finally is calculated, and utilizes Gauss error function meter It calculates part and peels off probability loop (xi).Loop (xi) value range is [0,1].Loop (xi) value is bigger, and xi is the index point that peels off Probability it is bigger.
The following are embodiment of the method corresponding with the above system embodiment, present embodiment can be mutual with above embodiment Cooperation is implemented.The above-mentioned relevant technical details mentioned in mode of applying are still effective in the present embodiment, in order to reduce repetition, this In repeat no more.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above embodiment.
The present invention also provides a kind of fault indices automatic marking method based on log, including:
Log collection step collects multiple pieces of system log from distributed system or standalone computer systems;
Index collection step acquires performance indicator data in the distributed system or the standalone computer systems;
Data processing step according to the event class of the system log filters out failure/fault log first, secondly basis The information content of every failure/fault log is that every failure/fault log distributes failure/fault classification, last basis Every class failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling step refers to corresponding all performances of the effective time window of every class failure/fault classification Mark data are modeled, and fault indices model is constructed;
Fault indices annotation step carries out the automatic marking performance to the performance indicator data according to the fault indices model Whether achievement data is fault indices.
The fault indices automatic marking method based on log, wherein the index collection step, is spaced at regular intervals Performance indicator data in acquisition system, the performance indicator data include cpu utilization rate, memory usage, disk read-write bandwidth, IPC, cache miss rate.
The fault indices automatic marking method based on log, wherein the data processing step includes: that time window divides Step, the timestamp by inquiring the failure/fault log with similar failure/fault classification simultaneously combine the preset time Window threshold value determines the effective time window of each failure/fault classification, and by searching for performance indicator in the effective time window Timestamp, obtain the performance indicator data in the effective time window ranges.
The fault indices automatic marking method based on log, wherein the off-line modeling step include:
Performance indicator data corresponding to aggregation step, all effective time windows by every class failure/fault classification are converged It is always fault indices group;
Model construction step, to the performance indicator data in each fault indices group, using a kind of target classification algorithm, Construct the fault indices model of the failure type.
The fault indices automatic marking method based on log, the wherein fault indices annotation step, by calculating the property Can locally peel off probability between achievement data and the fault indices model, mark whether the performance indicator data are fault indices.
Although the present invention is disclosed with above-described embodiment, specific examples are only used to explain the present invention, is not used to limit The present invention, any those skilled in the art of the present technique without departing from the spirit and scope of the invention, can make some change and complete It is kind, therefore the scope of the present invention is subject to claims.

Claims (8)

1. a kind of fault indices automatic marking system based on log characterized by comprising
Log collection module is used to collect multiple pieces of system log from distributed system or standalone computer systems;
Index collection module, for acquiring the performance indicator data in the distributed system or the standalone computer systems;
Data processing module, for filtering out failure/fault log according to the event class of the system log first, secondly basis The information content of every failure/fault log is that every failure/fault log distributes failure/fault classification, last basis Every class failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling module refers to for corresponding all performances of the effective time window to every class failure/fault classification Mark data are modeled, and fault indices model is constructed;
Fault indices labeling module is used to carry out the automatic marking performance to the performance indicator data according to the fault indices model Whether achievement data is fault indices;
Wherein the data processing module includes:
Time window division module, the timestamp and knot by inquiring with the failure/fault log of similar failure/fault classification Close preset time window threshold value, determine the effective time window of each failure/fault classification, and it is effective by searching for this when Between in window performance indicator timestamp, obtain the performance indicator data in the effective time window ranges.
2. the fault indices automatic marking system based on log as described in claim 1, which is characterized in that the index collection mould Block, the performance indicator data being spaced in acquisition system at regular intervals, which includes cpu utilization rate, memory Utilization rate, disk read-write bandwidth, IPC, cache miss rate.
3. the fault indices automatic marking system based on log as described in claim 1, which is characterized in that the off-line modeling mould Block includes:
Summarizing module is converged for performance indicator data corresponding to all effective time windows by every class failure/fault classification It is always fault indices group;
Model construction module, for the performance indicator data in each fault indices group, using a kind of target classification algorithm, Construct the fault indices model of the failure/fault classification.
4. the fault indices automatic marking system based on log as described in claim 1, which is characterized in that the fault indices mark Injection molding block marks the performance indicator number by calculating the probability that locally peels off between the performance indicator data and the fault indices model According to whether being fault indices.
5. a kind of fault indices automatic marking method based on log characterized by comprising
Log collection step collects multiple pieces of system log from distributed system or standalone computer systems;
Index collection step acquires performance indicator data in the distributed system or the standalone computer systems;
Data processing step according to the event class of the system log filters out failure/fault log first, secondly according to every The information content of the failure/fault log is that every failure/fault log distributes failure/fault classification, finally according to every class The failure/fault classification determines the effective time window of the performance indicator data;
Off-line modeling step, to corresponding all performance indicator numbers of the effective time window of every class failure/fault classification According to being modeled, fault indices model is constructed;
Fault indices annotation step carries out the automatic marking performance indicator to the performance indicator data according to the fault indices model Whether data are fault indices;
Wherein the data processing step includes: time window partiting step, by inquiring the mistake with similar failure/fault classification Effect/fault log timestamp simultaneously combines preset time window threshold value, determines the effective time of each failure/fault classification Window, and by searching for the timestamp of performance indicator in the effective time window, obtain being somebody's turn to do in the effective time window ranges Performance indicator data.
6. the fault indices automatic marking method based on log as claimed in claim 5, which is characterized in that index collection step Suddenly, the performance indicator data being spaced at regular intervals in acquisition system, which includes cpu utilization rate, memory Utilization rate, disk read-write bandwidth, IPC, cache miss rate.
7. the fault indices automatic marking method based on log as claimed in claim 5, which is characterized in that off-line modeling step Suddenly include:
Performance indicator data summarization corresponding to aggregation step, all effective time windows by every class failure/fault classification is Fault indices group;
Model construction step, to the performance indicator data in each fault indices group, use a kind of target classification algorithm, building The fault indices model of the failure type out.
8. the fault indices automatic marking method based on log as claimed in claim 5, which is characterized in that the fault indices mark It infuses step and marks the performance indicator number by calculating the probability that locally peels off between the performance indicator data and the fault indices model According to whether being fault indices.
CN201710450900.XA 2017-06-15 2017-06-15 A kind of fault indices automatic marking method and system based on log Active CN107301118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710450900.XA CN107301118B (en) 2017-06-15 2017-06-15 A kind of fault indices automatic marking method and system based on log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710450900.XA CN107301118B (en) 2017-06-15 2017-06-15 A kind of fault indices automatic marking method and system based on log

Publications (2)

Publication Number Publication Date
CN107301118A CN107301118A (en) 2017-10-27
CN107301118B true CN107301118B (en) 2019-11-19

Family

ID=60134736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710450900.XA Active CN107301118B (en) 2017-06-15 2017-06-15 A kind of fault indices automatic marking method and system based on log

Country Status (1)

Country Link
CN (1) CN107301118B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280021A (en) * 2018-01-25 2018-07-13 郑州云海信息技术有限公司 A kind of logging level analysis method based on machine learning
CN110245012A (en) * 2018-03-08 2019-09-17 中国移动通信集团广东有限公司 A kind of loose type virtualization resource dispatching method and system
CN108848512B (en) * 2018-05-30 2021-04-30 江南大学 SVDD wireless sensor network outlier data detection method
CN111367747B (en) * 2018-12-25 2023-07-04 中国移动通信集团浙江有限公司 Index abnormal detection early warning device based on time annotation
CN109918313B (en) * 2019-03-29 2021-04-02 武汉大学 GBDT decision tree-based SaaS software performance fault diagnosis method
CN110691070B (en) * 2019-09-07 2022-02-11 温州医科大学 Network abnormity early warning method based on log analysis
CN111309572B (en) * 2020-02-13 2021-05-04 上海复深蓝软件股份有限公司 Test analysis method and device, computer equipment and storage medium
CN113535759A (en) * 2020-04-14 2021-10-22 中国移动通信集团上海有限公司 Data labeling method, device, equipment and medium
CN116724296A (en) * 2021-10-26 2023-09-08 微软技术许可有限责任公司 Performing hardware fault detection based on multimodal feature fusion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101888309A (en) * 2010-06-30 2010-11-17 中国科学院计算技术研究所 Online log analysis method
CN103761173A (en) * 2013-12-28 2014-04-30 华中科技大学 Log based computer system fault diagnosis method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9734005B2 (en) * 2014-10-31 2017-08-15 International Business Machines Corporation Log analytics for problem diagnosis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101888309A (en) * 2010-06-30 2010-11-17 中国科学院计算技术研究所 Online log analysis method
CN103761173A (en) * 2013-12-28 2014-04-30 华中科技大学 Log based computer system fault diagnosis method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
云环境下基于统计监测的分布式软件系统故障检测技术研究;王焘;《计算机学报》;20170228;第40卷(第02期);全文 *
基于日志分析的虚拟化环境故障定位系统;田斐等;《计算机系统应用》;20141231;第23卷(第11期);第1-7页 *

Also Published As

Publication number Publication date
CN107301118A (en) 2017-10-27

Similar Documents

Publication Publication Date Title
CN107301118B (en) A kind of fault indices automatic marking method and system based on log
De Santo et al. Deep Learning for HDD health assessment: An application based on LSTM
US11048729B2 (en) Cluster evaluation in unsupervised learning of continuous data
CN106021062B (en) The prediction technique and system of relevant fault
Klinkenberg et al. Data mining-based analysis of HPC center operations
CN109871401A (en) A kind of time series method for detecting abnormality and device
Bifet et al. Improving adaptive bagging methods for evolving data streams
Chug et al. Software defect prediction using supervised learning algorithm and unsupervised learning algorithm
CN107133632A (en) A kind of wind power equipment fault diagnosis method and system
CN111177714A (en) Abnormal behavior detection method and device, computer equipment and storage medium
CN108415810B (en) Hard disk state monitoring method and device
CN110188834A (en) A kind of method for diagnosing faults of power telecom network, device and equipment
CN107111610A (en) Mapper component for neural language performance identifying system
CN106030565A (en) Computer performance prediction using search technologies
CN107004200A (en) The evaluated off-line of ranking function
CN112951311A (en) Hard disk fault prediction method and system based on variable weight random forest
Han et al. Statistical estimation of diffusion network topologies
Buda et al. ADE: An ensemble approach for early Anomaly Detection
Yu et al. Filtering log data: Finding the needles in the haystack
Agrawal et al. Analyzing and predicting failure in hadoop clusters using distributed hidden markov model
WO2015065379A1 (en) Parameter suggestion based on user activity
Zhang et al. Failure prediction in ibm bluegene/l event logs
Theron The use of data mining for predicting injuries in professional football players
CN106487592B (en) A kind of Faults in Distributed Systems diagnostic method based on data cube
Yu et al. Detecting of PIU behaviors based on discovered generators and emerging patterns from computer-mediated interaction events

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant