CN104954181A

CN104954181A - Method for warning faults of distributed cluster devices

Info

Publication number: CN104954181A
Application number: CN201510307233.0A
Authority: CN
Inventors: 葛祺; 于勇新
Original assignee: BEIJING GEO POLYMERIZATION NETWORK TECHNOLOGY Co Ltd
Current assignee: BEIJING GEO POLYMERIZATION NETWORK TECHNOLOGY Co Ltd
Priority date: 2015-06-08
Filing date: 2015-06-08
Publication date: 2015-09-30

Abstract

The invention provides a method for warning faults of distributed cluster devices. The method includes acquiring instant information data of clusters and nodes, storing the instant information data in system databases and supplementing the system databases with historical data by the instant information data; acquiring required data from the system databases according to requirements of evaluation models to form knowledge bases, periodically inputting the knowledge bases to implementation evaluation systems; outputting fault warning information by the aid of the implementation evaluation systems according to the instant information data, the evaluation models and the knowledge bases. The method has the advantages that the present network faults can be warned by the aid of instant information of the clusters and the nodes and multi-dimensional data such as the historical data and operation and maintenance conclusions, bases can be provided for operating and maintaining the devices, accordingly, the devices which need to be maintained in a highlighted manner can be found, and the faults of the devices can be prevented.

Description

A kind of distributed type assemblies equipment fault early-warning method

Technical field

The invention belongs to distributed data processing field, particularly relate to a kind of distributed type assemblies equipment fault early-warning method.

Background technology

In recent years, along with the integrated theory of cheap cluster is perfect, the practical experience implementing technology progressively improves.But because its theoretical foundation adopts cheapness, generic server to carry out horizontal expandable exactly, comparatively commercial server is high for the fault frequency of occurrences of cheap general individual server.In order to tackle the stable demand of data and service, need to carry out node redundancy.Because this type of cluster builds easily, advantage of lower cost, therefore the scope of application of cloud platform constantly expands, cluster server quantity easily tens, hundreds of.Large-scale office point even reaches more than thousand scales.

According to the achievement in research of Probability, even small probability event, the number of times that event occurs in respective counts magnitude will significantly increase, and substantially can reach a conclusion: in the scope of certain hour, single-point server failure inherently appears in large-scale cluster for this reason.Quantity along with fault machine does not stop to increase, and the load of residue machine can be caused to continue to increase, and impels again the fault frequency of occurrences of residue machine to increase.

For tackling above problem, special operation maintenance personnel can be set carry out regular visit process or add automatic monitoring script on this basis doing real-time informing, but this scheme all belongs to post, can not prejudge which machine may need emphasis O&M.

Secondly, general O&M process is all handling failure, release processing fault.Not by settling time between cluster state and node state, contacting spatially.

In addition, during cluster programming, its hardware configuration, number of nodes, flow topology, computing load are balanced, balanced Business Nature, the scale all carried with its cluster of memory load has direct relation.But substantially depend on the experience of scheme raiser during general cluster programming.Qualitative analysis, quantitatively conclusion can not be done.

Summary of the invention

Technical problem to be solved by this invention is to provide a kind of distributed type assemblies equipment fault early-warning method, carries out existing network fault pre-alarming, for the operation maintenance of equipment provides foundation, thus can find out the equipment needing emphasis to safeguard, prevent equipment from breaking down.

In order to solve the problems of the technologies described above, the invention provides a kind of distributed type assemblies equipment fault early-warning method, comprising:

Obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;

According to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;

Implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.

Preferably, the instant messages data of described acquisition cluster and node, comprising:

At network node on-premise network probe, gather instant network-related data; At each node system deploy system probe, acquisition system information data; At each service node deploy business probe, by operation layer software interface capturing service data.

Preferably, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.

Preferably, described method also comprises:

By described fault pre-alarming information feed back to system database, supplement as fault sample data.

The present invention is by the instant messages of cluster and node, and in conjunction with multi-dimensional data such as historical data, O&M conclusions, carry out existing network fault pre-alarming, the operation maintenance for equipment provides foundation, thus can find out the equipment needing emphasis to safeguard, prevents equipment from breaking down.The present invention and set up cluster and hardware configuration, number of nodes, flow topology by historical data, computing load is balanced, memory load is balanced etc., and related service expands associates, the planning for cluster provides scheme design considerations.When cluster programming, can historical data be searched, look at the failure condition of each node, or load capacity etc., plan according to historical data.

Accompanying drawing explanation

Fig. 1 is a kind of distributed type assemblies equipment fault early-warning system constituting method flow chart in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of all.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Main thought of the present invention is: first dispose the data that relevant data probe programmed acquisition is relevant, wherein 1) at relevant network node on-premise network probe, gather instant network-related data, 2) at each node system deploy system probe, gather the information datas such as cpu, internal memory, temperature, disk, 3) each service node deploy business probe, by operation layer software interface capturing service data.By real-time collecting module, above-mentioned data are stored in system database.

With reference to shown in Fig. 1, it is distributed type assemblies equipment fault early-warning method flow diagram a kind of in the embodiment of the present invention.Shown method comprises:

101, obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;

102, according to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;

103, implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.

In a preferred embodiment of the invention, the instant messages data of described acquisition cluster and node, comprising:

In a preferred embodiment of the invention, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.

In a preferred embodiment of the invention, described method also comprises:

The present invention according to assessment models, the knowledge base relevant according to historical data excavation, is regularly input to enforcement evaluating system by data-mining module.Implement evaluating system according to the real time information gathered, assessment models simultaneously, in conjunction with the knowledge base excavated, export relevant fault pre-alarming.The result of last early warning system process is fed back, and supplements as fault sample data.Whole system is self-iteration thus, progressively forms stable critic network.

Above-described embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only the specific embodiment of the present invention; the protection range be not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a distributed type assemblies equipment fault early-warning method, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, the instant messages data of described acquisition cluster and node, comprising:

3. the method for claim 1, is characterized in that, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.

4. the method for claim 1, is characterized in that, described method also comprises: