CN106708016A

CN106708016A - fault monitoring method and device

Info

Publication number: CN106708016A
Application number: CN201611199335.6A
Authority: CN
Inventors: 刘树仁; 冯超敏; 罗洪武; 文玲; 蔡长宁; 李长春; 张向阳; 朱启伟; 穆斌; 付占宝
Original assignee: Petrochina Co Ltd
Current assignee: Petrochina Co Ltd
Priority date: 2016-12-22
Filing date: 2016-12-22
Publication date: 2017-05-24
Anticipated expiration: 2036-12-22
Also published as: CN106708016B

Abstract

The embodiment of the application provides a fault monitoring method and a fault monitoring device, wherein the method comprises the following steps: acquiring state data of one or more target objects in the system; determining the probability of each target object failing according to the state data of the target objects; determining a target object with the probability of failure greater than a preset threshold value in the target objects as an object to be monitored; and determining the reason of the fault of the object to be monitored, and monitoring the object to be monitored according to the reason of the fault of the object to be monitored. According to the scheme, a distributed storage technology is utilized, and various algorithms are comprehensively applied to analyze the state data under a MapReduce framework, so that the fault probability of each target object in the system is predicted, and the target object to be monitored is monitored. The method solves the technical problems that the prior fault monitoring method can not early warn the potential fault in the system, has poor monitoring effect and low efficiency, and achieves the technical effect of effectively maintaining the system safety.

Description

Failure monitoring method and device

Technical field

The application is related to oil exploration technical field of data processing, more particularly to a kind of failure monitoring method and device.

Background technology

In oil exploration data processing field, by the data information to be studied treatment is very huge, it is often necessary to Using high performance computer cluster, work station and large-capacity and high-performance storage device etc. as seismic data process, explain Platform or system, are processed with to oil exploration data.

When specifically being processed oil exploration data using above-mentioned platform or system, because the data volume for requiring treatment increases Plus, cluster scale increase and various application software cross-reference so that cluster, work station, storage etc. are susceptible to all kinds of Failure, influences the carrying out of production task, in turn results in loss.Therefore, how the failure of monitoring data processing platform or system, protect The stability of card data processing platform (DPP) or system is increasingly subject to people's concern.

In order to ensure safe and stable, the failure in timely discovery system, existing failure monitoring of platform or system work Method, generally by the status data for gathering each equipment, ratio is analyzed by by device status data and predetermined threshold value Compared with, judge equipment whether failure.But, during specific implementation, the method can only find the equipment for having occurred and that failure, can only be to Equipment through breaking down sends alarm, it is impossible to which the failure to that will occur carries out effective prediction, alarm and safeguards.

Therefore, existing failure monitoring method exists and potential failure in system can not be predicted, in monitoring system The technical problem that accuracy is poor, efficiency is low of failure.

Regarding to the issue above, effective solution is not yet proposed at present.

The content of the invention

The embodiment of the present application provides a kind of failure monitoring method and device, to solve to be deposited in existing failure monitoring method The technical problem that can not predict that incipient fault, the accuracy of monitoring system failure are low.

The embodiment of the present application provides a kind of failure monitoring method, including：

The status data of one or more destination objects in acquisition system；

According to the status data of one or more of destination objects, each is determined in one or more of destination objects The probability that destination object breaks down；

Determine destination object of the probability broken down in described each destination object more than predetermined threshold value as waiting to supervise Control object；

The reason for determining the reason for object to be monitored breaks down, and broken down according to the object to be monitored, The object to be monitored is monitored.

In one embodiment, in acquisition system one or more destination objects status data, including：

According to interface type, the multiple destination object is divided into multiple clusters, wherein, the target in same cluster Object uses same type of interface；

The destination object being pointed in same cluster obtains the status data using same data acquiring mode.

In one embodiment, the destination object in same cluster is pointed to is obtained using same data acquiring mode After the status data, methods described also includes：

The status data of the destination object being located in different clusters is converted to the status data of same form.

In one embodiment, according to the status data of one or more of destination objects, determine it is one or The probability that each destination object breaks down in multiple destination objects, including：

According to the status data of one or more of destination objects, it is determined that the shape with one or more of destination objects One or more preset models of state data Corresponding matching；

According to one or more of preset models, each destination object hair in one or more of destination objects is determined The probability of raw failure.

In one embodiment, the reason for object to be monitored breaks down is determined, including：

Status data according to the object to be monitored and the default mould matched with the status data of the object to be monitored Type, determines the reason for object to be monitored breaks down.

In one embodiment, the reason for being broken down according to the object to be monitored, enters to the object to be monitored Row monitoring, including：

The probability that the reason for being broken down according to the object to be monitored and the object to be monitored break down, perform with At least one lower Business Processing：The object to be monitored that failure is had occurred and that in the system is repaired, deletes or replaces, repair, Delete or replace the object to be monitored that failure is had not occurred in the system, to the system in object to be monitored send announcement It is alert.

In one embodiment, sent out with the object to be monitored the reason for being broken down according to the object to be monitored The probability of raw failure, performs after the Business Processing, and methods described also includes：

The result after the Business Processing as monitored results data is stored in knowledge data base；

According to the monitored results data, the preset model is corrected.

User is received by presetting the system problem that passage is uploaded；

Using the system problem as the status data.

In one embodiment, the multiple preset model is by under MapReduce frameworks, with preset algorithm Obtain, wherein, the preset algorithm includes：Clustering algorithm and/or bayesian algorithm.

In one embodiment, the multiple preset model is by under MapReduce frameworks, with preset algorithm Obtain, including：On distributed storage platform, the multiple preset model is by under MapReduce frameworks, with pre- What imputation method was obtained.

In one embodiment, in the acquisition system after the status data of one or more destination objects, will be described Status data in a distributed manner database form store in the knowledge data base.

Based on identical inventive concept, the embodiment of the present application additionally provides a kind of failure monitoring device, including：

State data acquisition module, for the status data of one or more destination objects in acquisition system；

Probability of malfunction determining module, for the status data according to one or more of destination objects, determines described one The probability that each destination object breaks down in individual or multiple destination objects；

Object determining module to be monitored, for determining the probability broken down in described each destination object more than default The destination object of threshold value is used as object to be monitored；

Object handles module to be monitored, for determining the reason for object to be monitored breaks down, and treats according to described The reason for monitored object breaks down, is monitored to the object to be monitored.

In one embodiment, the state data acquisition module includes：

Assemblage classification unit, for according to interface type, the multiple destination object being divided into multiple clusters, wherein, position Destination object in same cluster uses same type of interface；

Data acquisition unit, the destination object for being pointed in same cluster obtains institute using same data acquiring mode State status data.

In one embodiment, the probability of malfunction determining module includes：

Preset model determining unit, for the status data according to one or more of destination objects, it is determined that with it is described One or more preset models of the status data Corresponding matching of one or more destination objects；

Probability of malfunction determining unit, for according to one or more of preset models, determining one or more of mesh The probability that each destination object breaks down in mark object.

In one embodiment, the object handles module to be monitored includes：

Failure cause determining unit, for the status data according to the object to be monitored and with the object to be monitored The preset model of status data matching, determines the reason for object to be monitored breaks down；

Service Processing Unit, occurs the reason for for being broken down according to the object to be monitored with the object to be monitored The probability of failure, performs the Business Processing of at least one of：Repair, delete or replace and have occurred and that failure in the system Object to be monitored, repairs, deletes or replaces the object to be monitored that failure is had not occurred in the system, to the system in treat Monitored object sends alarm.

In the embodiment of the present application, by Distributed Computing Platform (Hadoop platform), in MapReduce frameworks Interior, integrated use clustering algorithm and bayesian algorithm carry out depth to the status data of each destination object in the system that collects Enter analysis, obtain the probability that destination object breaks down, so can the destination object high to the probability that breaks down be monitored Treatment, the generation of trouble saving.Solve and incipient fault in system can not be carried out present in existing failure monitoring method The technical problem of prediction, has reached to having occurred in system and/or nonevent failure is while carry out early warning, improves monitoring failure The technique effect of accuracy.

Brief description of the drawings

In order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments described in application, for those of ordinary skill in the art, are not paying the premise of creative labor Under, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the process chart of the failure monitoring method according to the embodiment of the present application；

Fig. 2 is that the NB Algorithm in the failure monitoring method/device provided using the embodiment of the present application exists The schematic diagram for realizing flow under MapReduce frameworks；

Fig. 3 is that the failure monitoring method/device provided using the embodiment of the present application obtains status data schematic diagram；

Fig. 4 is the composition structure chart of the failure monitoring device according to the embodiment of the present application；

Fig. 5 is that the failure monitoring method/device provided using the embodiment of the present application is carried out to the data system for exploring center The schematic diagram of maintenance；

Fig. 6 is the generic state data collecting model in the failure monitoring method/device provided using the embodiment of the present application Schematic diagram；

Fig. 7 is the synthesis under MapReduce frameworks in the failure monitoring method/device provided using the embodiment of the present application With the schematic diagram of many Algorithm Analysis；

Fig. 8 is that the K-means clustering algorithms in the failure monitoring method/device provided using the embodiment of the present application exist The schematic diagram for realizing flow under MapReduce frameworks.

, specific embodiment

In order that those skilled in the art more fully understand the technical scheme in the application, below in conjunction with the application reality The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation Example is only some embodiments of the present application, rather than whole embodiments.Based on the embodiment in the application, this area is common The every other embodiment that technical staff is obtained under the premise of creative work is not made, should all belong to the application protection Scope.

In view of existing failure monitoring method, because simply by acquisition state data, by status data and predetermined threshold value Carrying out simply compare, fully utilization state data, do not carry out deep analyzing and processing to status data, lead When causing specific implementation, existing failure monitoring method presence can only find the failure having occurred and that, it is impossible to incipient fault in system Carry out early warning, and technical problem poor to malfunction monitoring effect, that efficiency is slow.For the basic reason for producing above-mentioned technical problem, The application considers can be by distributed storage method combination MapReduce frameworks, by integrated use many algorithms with abundant Using the status data of each destination object, by intellectual analysis, determine that the probability of malfunction and failure of each destination object are produced Reason, and then treat monitoring objective object and carry out preventive maintenance.Can not be right so as to solve that existing failure monitoring method is present Incipient fault carries out early warning, the low technical problem of the failure monitoring degree of accuracy, has reached to having occurred in system and/or nonevent Failure carries out early warning simultaneously, improves the technique effect of monitoring failure accuracy.

Based on above-mentioned thinking thinking, this application provides a kind of failure monitoring method.Refer to Fig. 1.What the application was provided Failure monitoring method, may comprise steps of.

Step 101：The status data of one or more destination objects in acquisition system.

In one embodiment, the destination object can specifically include CPU, GPU, storage device, the net in system Network attachment means and supporting infrastructure (such as radiator fan) etc..Certainly, it is necessary to explanation, above-mentioned cited target pair As if in order to the embodiment of the present invention is better described, during specific implementation, can according to construction requirement select other relevant devices or Device is used as destination object.In this regard, the application is not construed as limiting.

In one embodiment, due to multiple different types of destination objects may be included in a system or platform, And the interface of different types of semantic object extraction status data is different.For example, the system of oil exploration data processing centre is just Including multiple CPU and multiple storage device etc., and the interface of the status data of CPU can be obtained and storage device status number is obtained According to interface and differ.Efficiency and the degree of accuracy of status data are obtained to improve, can be in units of cluster, by same Cluster identical interface obtains the status data of each destination object in the cluster, specifically can be according to steps of processing：

S1：According to interface type, the multiple destination object is divided into multiple clusters, wherein, in same cluster Destination object uses same type of interface.

S2：The destination object being pointed in same cluster obtains the status data using same data acquiring mode.

In one embodiment, in different clusters destination object status data form is different, and the mesh for directly obtaining The status data form of mark object also differs and surely meets subsequent use requirement.For example, the shape of the destination object in CPU cluster The status data form of destination object and differed in the form and storage device cluster of state data.Therefore, in order that must gather Different clusters in the status data of destination object have unified data form so that the form of status data meets subsequent treatment It is required that.Specifically, the destination object in same cluster is pointed to obtains the status data using same data acquiring mode Afterwards, methods described can also include：The status data of the destination object being located in different clusters is converted into same form Status data.

For example, CPU different in data handling system is divided into a CPU cluster, connect by CPU state data acquisition Mouth obtains the status data of each CPU in CPU cluster.And the status data to each CPU in CPU cluster carries out unified lattice Formula is changed so that the form phase of the form of the status data of each CPU in CPU cluster and other destination object status datas Together, follow-up use requirement is met.GPU cluster, storage device cluster and network equally can also be respectively obtained in the manner described above The status data of the destination object of attachment means cluster etc..The application, will not be repeated here.

In one embodiment, in order to improve the effect of the status data that each destination object is read in subsequent processes Rate, improves the stability of status data, can be by the status data with HBase (Hadoop Database, distributed data The abbreviation in storehouse) form storage and show.Specifically, the status data can be stored in knowledge number in the form of distributed data base According in storehouse.It should be noted that store being different from general data library storage form using HBase forms, the method is by state Data carry out storing displaying in column form, such that it is able to improve the efficiency of reading, and improve data stability.Certainly, Can also be stored using other suitable databases as the case may be.In this regard, the application is not construed as limiting.

Step 102：According to the status data of one or more of destination objects, one or more of targets pair are determined The probability that each destination object breaks down as in.

In one embodiment, in order to be predicted to the destination object not broken down in system, can be by dividing The status data for analysing destination object determines the probability that destination object breaks down, the probability broken down according to destination object come Whether prediction destination object future can break down.Specific implementation can include：

S1：According to the status data of one or more of destination objects, it is determined that with one or more of destination objects Status data Corresponding matching one or more preset models.

In the present embodiment, it is determined that can be according to mesh with the preset model of the status data Corresponding matching of destination object The status data of object is marked, the default mould minimum with status data difference value is determined in multiple preset models from knowledge data base Type is used as the corresponding Matching Model.

It should be noted that in the present embodiment, in order to judge what destination object was matched according to status data exactly Preset model, can differentiate the preset model corresponding to status data by NB Algorithm during specific implementation.For example, can With the status data according to certain destination object, by Reduce tasks, calculate respectively each destination object belong to each preset The probable value of category of model, and find out the classification of the preset model corresponding to maximum probability, as the destination object Corresponding matching Preset model.

S2：According to one or more of preset models, each target pair in one or more of destination objects is determined As the probability for breaking down.

In one embodiment, the multiple preset model is by under MapReduce frameworks, with preset algorithm Obtain.That is, by under MapReduce frameworks, integrated use many algorithms are obtained the multiple preset model, wherein, institute Stating preset algorithm, i.e. many algorithms includes：Clustering algorithm and bayesian algorithm.It should be noted that the clustering algorithm and shellfish Leaf this algorithm is generally required and could efficiently and accurately run under MapReduce frameworks.And entirely MapReduce frameworks are general Need synthetically run many algorithms, i.e., described clustering algorithm and pattra leaves on distributed storage platform (Hadoop platform) again This algorithm, solves corresponding problem.

In one embodiment, the multiple preset model is by under MapReduce frameworks, the various calculations of integrated use Method is obtained, including：On distributed storage platform, the multiple preset model by under MapReduce frameworks, integrated use Many algorithms are obtained.Wherein, described MapReduce frameworks can be a kind of programming model framework, be used for large-scale data The concurrent operation of collection (being more than 1TB)." Map (the mappings, for one group of key assignments it should be noted that the concept in MapReduce To being mapped to one group of new key-value pair) " and " Reduce (for each in the key-value pair for ensureing all mappings share by reduction Identical key group) ", all it is the characteristic according to Functional Programming and vector programming language, obtain.MapReduce frameworks During specific implementation, can facilitate programming personnel will not distributed parallel program in the case of, by correspondence program operate in distribution In formula system, parallel computation is realized, improve efficiency and the degree of accuracy of computing.In the present embodiment, it is many in order to be ready in advance Individual preset model storage, can be on distributed storage platform (Hadoop platform) in knowledge data base, can be by being based on Many algorithm synthesis algorithms of MapReduce frameworks, sample data is carried out abundant excavation treatment (including：Cluster, obtains multiple samples This type and training, obtain multiple preset models), obtain accurate preset model.During specific implementation, can include：

S1：Clustering processing is carried out to multiple samples by K-means (i.e. the English name of K averaging methods) clustering algorithm, is obtained To multiple sample types.Can specifically include：

1) concentrated from survey data central apparatus status data and choose k (plan classification number) individual sample data as center.

2) all data to the distance at each center are measured, a minimum range is found out, and is put under such, obtained final product To initial sample type.

3) all kinds of centers are recalculated.The step of repeating 2,3, until meeting the threshold value of setting.In principal function, it is necessary to The threshold value being designed correctly, and by iterative program, realize calling Map functions and the continuous of Reduce functions, until satisfaction sets Fixed threshold value, you can to obtain multiple sample types.

S2：Multiple sample patterns are trained by NB Algorithm, obtain preset model.Can specifically refer to Fig. 2, including：

S2-1：If X={ a₁, a₂..., a_mIt is an item to be sorted, and each a is a characteristic attribute of x.

S2-2：There are category set C={ y₁, y₂..., y_m}。

S2-3：Calculate P (y₁|x)、P(y₂|x)、...P(y_n|x)。

S2-4：If P (y_k| x)=max { P (y₁| x), P (y₂| x) ..., P (y_n| x) }, then x ∈ y_k。

S2-5：By repeatedly test, according to actual recognition result, each characteristic attribute in sample type is directed to The multiple correction of property, obtains preset model.

Wherein, X is status data to be analyzed, a₁, a₂..., a_mIt is each characteristic attribute data in data to be analyzed, C is the set of multiple preset models, y₁, y₂..., y_mIt is multiple preset models, P (y₁|x)、P(y₂| x) ..., P (y_n| x) divide Not for X belongs to y₁, y₂..., y_mThe probable value of each preset model.

During specific implementation, Fig. 3 can also be referred to.Mac1 data strips can be in system some time point obtain each The set of the status data of destination object, i.e., equivalent to an item to be sorted represented by above-mentioned X.Wherein, Mac1 data strips In each lattice data correspond to each destination object a kind of status data.Data in i.e. each lattice are equivalent to above-mentioned a₁, a₂..., a_mEach represented characteristic attribute data.It should be noted that can be according to specific in point at the same time Situation obtains a plurality of different conditions data of same target.For example, in Fig. 3, the data 20% in the 5th small lattice, the 6th small lattice In the small lattice of data 10 and the 7th in data 5, may each be the status data at certain CPU time point in system.Specifically, 20% can be the status data of the CPU remaining spaces, and 10 can be the state of Swap (exchange partition) service condition of the CPU Data, 5 can be the status data of Buffer (buffer) service condition of the CPU.Accordingly, the C in above-mentioned formula is suitable In the set of preset model.y₁, y₂..., y_mEquivalent to each the specific preset model in preset model set.Such as y₁, y₂..., y_mCan be respectively cpu fault model, fan failure model, GPU fault models ... etc..It is wherein, described that each is pre- If mould model can include the various status data values of corresponding each destination object.P (y are calculated respectively₁|x)、P(y₂| X) ..., P (y_n| x), equivalent to according to each status data value in Mac1 and each preset model y₁, y₂..., y_mIn it is right The similarity degree of each status data value answered, calculates Mac1 and belongs to y₁, y₂..., y_mIn each preset model probable value, enter And can judge that the system mode corresponding to Mac1 belongs to state corresponding to which kind of preset model according to these probable values, example Such as, state when CPU breaks down, or the state that GPU breaks down, or other states are belonging to.Calculate P (y_k| x)= max{P(y₁| x), P (y₂| x) ..., P (y_n| x) }, equivalent to the probable value for belonging to each preset model according to Mac1, it is determined that most Preset model corresponding to greatest is the preset model of the most proximity corresponding to Mac1.And then can consider represented by Mac1 State is the state corresponding to the preset model.For example, belonging to preset model y according to Mac1 is calculated₂Probable value it is maximum, And preset model y₂Corresponding situation is situation when CPU is overheated, therefore may determine that the time period of collection Mac1 data, is There is the situation of certain CPU operation overheats in system.

It should be noted that in the above-described embodiment, for calculation procedure 3) in each conditional probability, specifically can be with Process in the following manner：

S2-3-1：The item set to be sorted classified known to one is found, this set is called training sample set.

S2-3-2：Statistics obtains estimating in the conditional probability of lower each characteristic attribute of all categories.I.e.

Wherein, X is status data to be analyzed, a₁, a₂..., a_mIt is each characteristic attribute data in data to be analyzed, P(a₁|y₁), P (a₂|y₁) ..., P (a_m|y₁) ..., P (a₁|y_n), P (a₂|y_n) ..., P (a_m|y_n) each characteristic attribute difference Belong to y₁, y₂..., y_mThe probable value of each preset model scope.

S2-3-3：If each characteristic attribute is conditional sampling, following derivation is had according to Bayes' theorem：

Because denominator is constant for all categories, therefore can be maximized molecule here.Again because each characteristic attribute It is conditional sampling, so having：

Wherein, P (a₁|y_i)P(a₂|y_i)...P(a_m|y_i) represent that each characteristic attribute belongs to preset model y respectively_iIt is general Rate, P (y_i) represent preset model y_iThe probability of generation, P (x) represents total probability, P (y_i| x) represent that status data X belongs to default mould Type y_iProbability.

Step 103：Determine destination object of the probability more than predetermined threshold value broken down in described each destination object As object to be monitored.

In one embodiment, predetermined threshold value can be set as the case may be.When destination object break down it is general When rate is more than the predetermined threshold value, though the target destination object not yet breaks down, but may determine that the destination object have compared with Failure risk high, i.e., be likely to break down, it is necessary to pay close attention to prevent in time in a following time period.Cause This, can using the probability for breaking down more than predetermined threshold value destination object as object to be monitored carry out close supervision and other Relevant treatment.

Step 104：Determine the reason for object to be monitored breaks down, and broken down according to the object to be monitored The reason for, the object to be monitored is monitored.

In one embodiment, for the generation of trouble saving, process in time or early warning incipient fault, Ke Yijin One step determines the reason for object to be monitored breaks down.Can specifically include according to the status data of the object to be monitored and with The preset model of the status data matching of the object to be monitored, determines the reason for object to be monitored breaks down.Need Illustrate, here with the preset model of object matching to be monitored obtained by great amount of samples data processing, and store In knowledge data base.Wherein, the preset model is contained and the related bulk information of the monitored object.According to these letters Breath, it may be determined that the reason for monitored object breaks down.

In one embodiment, in order to prevent the generation of incipient fault, the original that can be broken down according to monitored object Cause, treatment is monitored to monitored object.Wherein, the monitoring can include the original broken down according to the object to be monitored The probability that cause and the object to be monitored break down, performs the Business Processing of at least one of：Repair, delete or replace institute State the object to be monitored that failure is had occurred and that in system, repair, delete or replace have not occurred in the system failure wait supervise Control object, and to the system in object to be monitored send alarm.During specific implementation, above-mentioned one can be performed to monitored object Monitoring is planted, can also above-mentioned various monitoring be performed to monitored object.It is, of course, also possible to as the case may be, using other than the above Other suitable methods treat monitored object and are monitored treatment.In this regard, the application is not construed as limiting.

In one embodiment, treat monitored object for basis and be monitored treatment, specifically can be according to ITIL (Information Technology Infrastructure Library, the abbreviation in IT infrastructure storehouse) flow, The reason for being broken down according to the object to be monitored in the way of IT is serviced, is carried out corresponding specific to the object to be monitored Monitoring is processed.

In one embodiment, in order to further improve the degree of accuracy to failure monitoring, can be according to the monitoring of feedback Result is targetedly corrected to original preset model.I.e. the reason for being broken down according to the object to be monitored and institute The probability that object to be monitored breaks down is stated, is performed after the Business Processing, methods described can also include：

S1：The result after the Business Processing as monitored results data is stored in knowledge data base.

S2：According to the monitored results data, the preset model is corrected.

Wherein, the correction can be the monitored results data according to feedback pointedly to preset model certain is specific Parameter value is modified, or the weight of original characteristic parameter of preset model is modified.In this regard, the application is not It is construed as limiting.

In one embodiment, in order to obtain more comprehensively more detailed status data, can extend and gather each target The channel of Obj State data.Therefore, in acquisition system one or more destination objects status data, can specifically include：

S1：User is received by presetting the system problem that passage is uploaded.

S2：Using the system problem as the status data.

In the embodiment of the present application, compared to existing failure monitoring method, this method utilizes distributed storage technology, Under MapReduce frameworks, the status data of each destination object by integrated use many algorithms to collecting fills Analysis, obtains the probability of malfunction and failure cause of each destination object, and then the destination object that do not broken down can be entered Row preventive maintenance.So as to solve that existing failure monitoring method is present early warning and monitoring can not be carried out to nonevent failure therefore The low technical problem of the barrier degree of accuracy, has reached to having occurred and that with nonevent failure the technology effect while being monitored in system Really.

A kind of failure monitoring device is additionally provided based on same inventive concept, in the embodiment of the present invention, such as following implementation Example is described.Because the principle of device solve problem is similar to failure monitoring method, therefore the implementation of failure monitoring device can be joined See the implementation of failure monitoring method, repeat part and repeat no more.Used below, term " unit " or " module " can be real The combination of the software and/or hardware of existing predetermined function.Although the device described by following examples is preferably realized with software, But hardware, or the realization of the combination of software and hardware is also that may and be contemplated.Fig. 4 is referred to, is implementation of the present invention A kind of composition structure chart of the failure monitoring device of example, the device can include：State data acquisition module 401, probability of malfunction Determining module 402, object determining module 403 to be monitored and object handles module 404 to be monitored, are carried out specifically to the structure below Explanation.

State data acquisition module 401, can be used for the status data of one or more destination objects in acquisition system.

Probability of malfunction determining module 402, can be used for the status data according to one or more of destination objects, it is determined that The probability that each destination object breaks down in one or more of destination objects.

Object determining module 403 to be monitored, is determined for out the probability broken down in described each destination object More than predetermined threshold value destination object as object to be monitored.

Object handles module 404 to be monitored, the reason for being determined for the object to be monitored and break down, and according to The reason for object to be monitored breaks down, is monitored to the object to be monitored.

In one embodiment, in order to improve efficiency and the degree of accuracy of acquisition state data, state data acquisition module 401 can include：

Assemblage classification unit, for according to interface type, the multiple destination object being divided into multiple clusters, wherein, position Destination object in same cluster uses same type of interface.

Data capture unit, the destination object for being pointed in same cluster obtains institute using same data acquiring mode State status data.

In one embodiment, in order to by the status data consolidation form of different-format in different clusters, the state Data acquisition module can also include format conversion unit, and the status data of the destination object for that will be located in different clusters turns It is changed to the status data of same form.

In one embodiment, in order in determining one or more of destination objects each destination object break down Probability, probability of malfunction determining module 402 can include：

First determining unit, for the status data according to one or more of destination objects, it is determined that with it is one Or one or more preset models of the status data Corresponding matching of multiple destination objects.It should be noted that first determines list The preset model that unit can most be matched by NB Algorithm determination with status data.

Second determining unit, for according to one or more of preset models, determining one or more of targets pair The probability that each destination object breaks down as in.

In one embodiment, in order to obtain multiple preset models, the probability of malfunction determining module 402 can also be wrapped Include preset model and set up unit, for obtaining multiple sample datas；According to multiple sample datas, by K-means clustering algorithms, Obtain multiple sample types；Multiple sample types are trained by NB Algorithm, obtain multiple preset models.

In one embodiment, in order to determine the reason for object to be monitored breaks down, object handles to be monitored Module 404 can include failure cause determining unit, and reason is waited to supervise according to the status data of the object to be monitored and with described The preset model of the status data matching of object is controlled, the reason for object to be monitored breaks down is determined.

In one embodiment, occur to be monitored to the object to be monitored with trouble saving or failure is entered Go and process in time, object handles module 404 to be monitored can include processing unit, for there is event according to the object to be monitored The probability that the reason for barrier and the object to be monitored break down, performs the Business Processing of at least one of：Repair, delete or The object to be monitored that failure is had occurred and that in the system is replaced, to be repaired, delete or is replaced and have not occurred failure in the system Object to be monitored, to the system in object to be monitored send alarm.

In one embodiment, in order to improve the degree of accuracy of preset model, and then the precision of monitoring failure is improved, it is described Probability of malfunction determining module 402 can also include preset model correct unit, for using the result after the Business Processing as Monitored results data, are stored in knowledge data base；According to the monitored results data, specific aim is carried out to the preset model Correction.

In one embodiment, in order to gather more comprehensive and accurate status data, state data acquisition module 401 can With including for feedback unit, for receiving user by presetting the system problem that passage is uploaded；And make the system problem It is the status data.

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for system reality Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

It should be noted that system, device, module or unit that above-mentioned implementation method is illustrated, specifically can be by computer Chip or entity are realized, or are realized by the product with certain function.For convenience of description, in this manual, retouch It is divided into various units with function when stating apparatus above to describe respectively.Certainly, can be the function of each unit when the application is implemented Realized in same or multiple softwares and/or hardware.

Additionally, in this manual, adjective as such as first and second can be only used for an element or dynamic Make to be made a distinction with another element or action, without requiring or implying any actual this relation or order.Permit in environment Perhaps in the case of, in only element, part or step is should not be interpreted as limited to reference to element or part or step (s) It is individual, and can be in element, part or step one or more etc..

As can be seen from the above description, the embodiment of the present application is provided the monitoring of failure side method and device, using distribution Formula storage platform, by the shape of each destination object of the integrated use many algorithms to collecting under MapReduce frameworks State data carry out intellectual analysis treatment, obtain the probability of malfunction and failure cause of each destination object, and then can be in system The destination object for not yet breaking down is monitored and prevents.Solving can not be to sending out present in existing failure monitoring method Raw failure carries out early warning and monitoring, monitor the low technical problem of the degree of accuracy of failure, has reached to being had occurred and that in system Failure and nonevent failure are monitored simultaneously, and improve the technique effect of the failure monitoring degree of accuracy；Further through by each mesh Mark object is divided into the status data of each destination object in corresponding cluster, and then the same cluster of acquisition, and to same cluster In the status data of each destination object be uniformly processed, improve the efficiency of state data acquisition, also reduce status number According to error；Also combined by application distribution formula (Hadoop) storage platform many by integrated use under MapReduce frameworks Plant algorithm carries out deep excavation to status data, obtains the probability of malfunction and failure cause of each destination object, further carries The degree of accuracy of failure monitoring high；Additionally, by according to probability of malfunction and failure cause to system in have occurred and that and do not send out Raw failure is targetedly prevented or maintenance is processed, and reaches the technique effect of effective maintenance system stabilization；In addition, always according to Monitored results are targetedly corrected to preset model, improve the precision of preset model, are reached and further improve failure prison Control the technique effect of the degree of accuracy.

Scene is embodied at one, number of the failure monitoring method/device to survey data center is provided using the application Failure monitoring is carried out according to system.

Fig. 5 can be referred to, is to propose that failure monitoring method/device safeguards the data system at survey data center using the application The schematic diagram of system.Can specifically include：

1) data monitoring and acquisition module

By integrating, realize that data center's sorts of systems (CPU cluster, GPU cluster, storage, network, infrastructure) is discrete The integrated monitoring of module.

2) ITIL process modules：

Failure is found by monitoring system, and is committed to ITIL (Information Technology automatically Infrastructure Library, the abbreviation in IT infrastructure storehouse) flow, realize efficiently IT services.

User unifies to be encountered problems in submission research and production by ITIL information desks, and the processing procedure of problem has detailed Log recording, user and administrative staff can be tracked to issue handling process and result.

3) fault processing module based on Hadoop platform

Analyzed by warning data acquisition, fault filtering, failure dependency, quickly position and solve all kinds of failures.

Analyzed by many algorithm synthesis under MapReduce frameworks, find out in system potential failure and by failure report Accuse, realize Initiative Defense in advance.

4) knowledge base and performance evaluation

Set up a performance point for integrating Data Integration, information inquiry, on-line analysis, multidimensional analysis, dynamic statement Analysis system, can carry out information analysis with aid decision making person from multi-angle.Statistics including various resources, the statistics of situation on duty, The statistics of routine work；Again various statistical items are set up with index, and decision-making is formulated according to index or indicator combination.

The linkage of many algorithm synthesis analysis under database realizing and ITIL flows, MapReduce frameworks, enables knowledge base Enough constantly addition new knowledges, strengthen the ability of troubleshooting.

It should be noted that above-mentioned data monitoring and acquisition module, can refer to shown in Fig. 6, i.e. this Shen during specific implementation Please embodiment proposition generic state data acquisition, the collection of all kinds of survey data central apparatus of completion.Including：

1) various kinds of equipment provides different protocol interfaces, and such as CPU/GPU clusters obtain facility information in SSH modes, and deposit Storage equipment generally provides SMI_S agreements.As the case may be, various kinds of equipment status information is obtained.

2) to the data for gathering, using generic state data conversion module, the uniform data storage of all data is realized (HBase) and uniform data displaying.

The above-mentioned fault processing module based on Hadoop platform, when carrying out accident analysis, can specifically refer to shown in Fig. 7, Many algorithm synthesis analysis models under the MapReduce frameworks that i.e. the embodiment of the present application is proposed, in being also whole processing module Core.Including：

1) state acquisition module completes the collection of each equipment running status data in survey data center, by unified mould Type, realizes the state data acquisition of CPU cluster, GPU cluster, the network equipment, storage device；

2) status data memory module uses HBase, realizes that the huge status data of dynamic time sequence, historical data is efficiently deposited Storage；

3) analysis and processing module of running state data is the core content of this paper, is included in reality under MapReduce frameworks Two existing algorithms.Wherein K-Means clustering algorithms are clustered to running state data, and the running status of each generation is gathered Class center is used as sample, shape sample knowledge storehouse；Bayes is trained to each knowledge base, and testing data is differentiated, Finally reach fault pre-alarming.

When specifically carrying out accident analysis using fault processing module, can refer to shown in Fig. 8, be that the embodiment of the present application is proposed K-means clustering algorithms realize flow under MapReduce frameworks.

Wherein, K-means clustering algorithms are a processes for iteration, specifically, can follow the steps below iteration：

S1：Concentrated from survey data central apparatus status data and choose k (plan classification number) individual data as center.

S2：All data to the distance at each center are measured, a minimum range is found out, and is put under such.

S3：Recalculate all kinds of centers.

Step S2 and step S3 is repeated, until meeting the threshold value of setting., it is necessary to the threshold being designed correctly in principal function Value, and by iterative program, realize calling Map functions and the continuous of Reduce functions, until meeting the threshold value of setting.

Refering to shown in Fig. 2, being realization of the NB Algorithm under MapReduce frameworks that the embodiment of the present application is proposed Process.

Wherein, Naive Bayesian Classifier be it is a kind of based on statistical sorting technique, including training and PANBIE two parts.Specific implementation can include：

S1：If X={ a₁, a₂..., a_mIt is an item to be sorted, and each a is a characteristic attribute of x.

S2：There are category set C={ y₁, y₂..., y_m}。

S3：Calculate P (y₁|x)、P(y₂| x) ..., P (y_n|x)。

S4：If P (y_k| x)=max { P (y₁| x), P (y₂| x) ..., P (y_n| x) }, then x ∈ y_k。

So present key is how to calculate each conditional probability in the 3rd step, can be with during specific implementation：

S3-1：The item set to be sorted classified known to one is found, this set is called training sample set.

S3-2：Statistics obtains estimating in the conditional probability of lower each characteristic attribute of all categories.I.e.

S3-3：If each characteristic attribute is conditional sampling, following derivation is had according to Bayes' theorem：

Because denominator is constant for all categories, as long as all may be used because we maximize molecule.Again because each feature Attribute is conditional sampling, so having：

It should be noted that the process that algorithm runs under MapReduce frameworks, can specifically include three below step：

S1：Data preparation stage, realizes the slitting of data；

S2：The data classification based training stage, Map task computations each classification P (y_i) value；

S3：Data sorting phase, Reduce task computations each classification P (x | y_i)P(y_i), and find out maximum P (x | y_i)P(y_i), the classification that as certain sample to be tested belongs to.

It is applied in specific implement scene by the failure monitoring method/device for providing the embodiment of the present application, is verified What the embodiment of the present application provided that failure monitoring method/device really can solve that existing failure monitoring method is present can not send out Incipient fault in existing system, the low technical problem of the monitoring failure degree of accuracy, has reached same with nonevent failure to having occurred and that When the technique effect that is monitored and processes.

Although mentioning different failure monitoring methods or device in teachings herein, the application is not limited to must Must be professional standard or the situation described by embodiment etc., some professional standards or be described using self-defined mode or embodiment Practice processes on embodiment amended slightly can also realize above-described embodiment it is identical, equivalent or close or deformation after It is anticipated that implementation result.Using the embodiment of data acquisition, treatment, output, the judgment mode after these modifications or deformation etc., Still within the scope of may belong to the optional embodiment of the application.

Although this application provides the method operating procedure as described in embodiment or flow chart, based on conventional or noninvasive The means of the property made can include more or less operating procedures.The step of being enumerated in embodiment order is only numerous steps A kind of mode in execution sequence, unique execution sequence is not represented.When device or client production in practice is performed, can Performed or executed in parallel (such as at parallel processor or multithreading with according to embodiment or method shown in the drawings order The environment of reason, even distributed analysis processing environment).Term " including ", "comprising" or its any other variant be intended to contain Lid nonexcludability is included, so that process, method, product or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, product or equipment Intrinsic key element.In the absence of more restrictions, be not precluded from the process including the key element, method, product or Also there are other identical or equivalent elements in person's equipment.

Device that above-described embodiment is illustrated or module etc., can specifically be realized by computer chip or entity, or by having There is the product of certain function to realize.For convenience of description, it is divided into various modules with function during description apparatus above to retouch respectively State.Certainly, the function of each module can be realized in same or multiple softwares and/or hardware when the application is implemented, Can will realize that the module of same function is realized by the combination of multiple submodule.Device embodiment described above is only Schematically, for example, the division of the module, only a kind of division of logic function, can there is other drawing when actually realizing The mode of dividing, such as multiple module or components can be combined or be desirably integrated into another system, or some features can be ignored, Or do not perform.

It is also known in the art that in addition to realizing controller in pure computer readable program code mode, it is complete Entirely can by by method and step carry out programming in logic come cause controller with gate, switch, application specific integrated circuit, may be programmed Logic controller realizes identical function with the form of embedded microcontroller etc..Therefore this controller is considered one kind Hardware component, and the device for realizing various functions included to its inside can also be considered as the structure in hardware component.Or Person even, can be used to realizing that the device of various functions is considered as not only being the software module of implementation method but also can be hardware Structure in part.

The application can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure, class etc..The application can also be in a distributed computing environment put into practice, in these DCEs, Task is performed by the remote processing devices connected by communication network.In a distributed computing environment, program module can With in the local and remote computer-readable storage medium including including storage device.

As seen through the above description of the embodiments, those skilled in the art can be understood that the application can Realized by the mode of software plus required general hardware platform.Based on such understanding, the technical scheme essence of the application On the part that is contributed to prior art in other words can be embodied in the form of software product, the computer software product Can store in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used to so that a computer equipment (can be personal computer, mobile terminal, server, or network equipment etc.) performs each embodiment of the application or implementation Method described in some parts of example.

Each embodiment in this specification is described by the way of progressive, same or analogous portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.The application can be used for crowd In more general or special purpose computing system environments or configuration.For example：Personal computer, server computer, handheld device or Portable set, laptop device, multicomputer system, the system based on microprocessor, set top box, programmable electronics set Standby, network PC, minicom, mainframe computer, the DCE including any of the above system or equipment etc..

Although depicting the application by embodiment, it will be appreciated by the skilled addressee that the application have it is many deformation and Change is without deviating from spirit herein, it is desirable to which appended claim includes these deformations and changes without deviating from the application.

Claims

1. a kind of failure monitoring method, it is characterised in that including：

The status data of one or more destination objects in acquisition system；

According to the status data of one or more of destination objects, each target in one or more of destination objects is determined The probability that object breaks down；

Destination object using the probability broken down in described each destination object more than predetermined threshold value is used as object to be monitored；

The reason for determining the reason for object to be monitored breaks down, and broken down according to the object to be monitored, to institute Object to be monitored is stated to be monitored.

2. method according to claim 1, it is characterised in that the status number of one or more destination objects in acquisition system According to, including：

According to interface type, the multiple destination object is divided into multiple clusters, wherein, the destination object in same cluster Using same type of interface；

3. method according to claim 2, it is characterised in that the destination object in same cluster is pointed to is using same Data acquiring mode is obtained after the status data, and methods described also includes：

4. method according to claim 1, it is characterised in that according to the status number of one or more of destination objects According to, determine the probability that each destination object breaks down in one or more of destination objects, including：

According to the status data of one or more of destination objects, it is determined that the status number with one or more of destination objects According to one or more preset models of Corresponding matching；

According to one or more of preset models, in determining one or more of destination objects there is event in each destination object The probability of barrier.

5. method according to claim 4, it is characterised in that determine the reason for object to be monitored breaks down, wraps Include：

Status data according to the object to be monitored and the preset model matched with the status data of the object to be monitored, really The reason for fixed object to be monitored breaks down.

6. method according to claim 4, it is characterised in that the reason for being broken down according to the object to be monitored is right The object to be monitored is monitored, including：

The probability that the reason for being broken down according to the object to be monitored and the object to be monitored break down, execution is following extremely One of few Business Processing：The object to be monitored that failure is had occurred and that in the system is repaired, deleted or replaced, repaired, deleted Or replace the object to be monitored that failure is had not occurred in the system, to the system in object to be monitored send alarm.

7. method according to claim 6, it is characterised in that the reason for being broken down according to the object to be monitored and The probability that the object to be monitored breaks down, performs after the Business Processing, and methods described also includes：

According to the monitored results data, the preset model is corrected.

8. method according to claim 1, it is characterised in that the status number of one or more destination objects in acquisition system According to, including：

User is received by presetting the system problem that passage is uploaded；

Using the system problem as the status data.

9. method according to claim 4, it is characterised in that the multiple preset model is by MapReduce frames Under frame, obtained with preset algorithm, wherein, the preset algorithm includes：Clustering algorithm and/or bayesian algorithm.

10. method according to claim 9, it is characterised in that the multiple preset model is by MapReduce frames Under frame, obtained with preset algorithm, including：On distributed storage platform, the multiple preset model is by described Under MapReduce frameworks, obtained with the preset algorithm.

11. methods according to claim 1, it is characterised in that the shape of one or more destination objects in acquisition system After state data, by the status data in a distributed manner database form store in the knowledge data base.

A kind of 12. failure monitoring devices, it is characterised in that including：

Probability of malfunction determining module, for the status data according to one or more of destination objects, determine it is one or The probability that each destination object breaks down in multiple destination objects；

Object determining module to be monitored, for mesh of the probability more than predetermined threshold value that will be broken down in described each destination object Mark object is used as object to be monitored；

Object handles module to be monitored, for determining the reason for object to be monitored breaks down, and according to described to be monitored The reason for object breaks down, is monitored to the object to be monitored.

13. devices according to claim 12, it is characterised in that the state data acquisition module includes：

Assemblage classification unit, for according to interface type, the multiple destination object being divided into multiple clusters, wherein, positioned at same Destination object in one cluster uses same type of interface；

Data acquisition unit, the destination object for being pointed in same cluster obtains the shape using same data acquiring mode State data.

14. devices according to claim 12, it is characterised in that the probability of malfunction determining module includes：

Preset model determining unit, for the status data according to one or more of destination objects, it is determined that with it is one Or one or more preset models of the status data Corresponding matching of multiple destination objects；

Probability of malfunction determining unit, for according to one or more of preset models, determining one or more of targets pair The probability that each destination object breaks down as in.

15. devices according to claim 12, it is characterised in that the object handles module to be monitored includes：

Failure cause determining unit, for the status data according to the object to be monitored and the state with the object to be monitored The preset model of Data Matching, determines the reason for object to be monitored breaks down；

Service Processing Unit, breaks down the reason for for being broken down according to the object to be monitored with the object to be monitored Probability, perform at least one of Business Processing：Repair, delete or replace have occurred and that in the system failure wait supervise Control object, repairs, deletes or replaces the object to be monitored that failure is had not occurred in the system, to the system in it is to be monitored Object sends alarm.