CN104954181A - Method for warning faults of distributed cluster devices - Google Patents

Method for warning faults of distributed cluster devices Download PDF

Info

Publication number
CN104954181A
CN104954181A CN201510307233.0A CN201510307233A CN104954181A CN 104954181 A CN104954181 A CN 104954181A CN 201510307233 A CN201510307233 A CN 201510307233A CN 104954181 A CN104954181 A CN 104954181A
Authority
CN
China
Prior art keywords
data
node
instant messages
probe
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510307233.0A
Other languages
Chinese (zh)
Inventor
葛祺
于勇新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING GEO POLYMERIZATION NETWORK TECHNOLOGY Co Ltd
Original Assignee
BEIJING GEO POLYMERIZATION NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING GEO POLYMERIZATION NETWORK TECHNOLOGY Co Ltd filed Critical BEIJING GEO POLYMERIZATION NETWORK TECHNOLOGY Co Ltd
Priority to CN201510307233.0A priority Critical patent/CN104954181A/en
Publication of CN104954181A publication Critical patent/CN104954181A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method for warning faults of distributed cluster devices. The method includes acquiring instant information data of clusters and nodes, storing the instant information data in system databases and supplementing the system databases with historical data by the instant information data; acquiring required data from the system databases according to requirements of evaluation models to form knowledge bases, periodically inputting the knowledge bases to implementation evaluation systems; outputting fault warning information by the aid of the implementation evaluation systems according to the instant information data, the evaluation models and the knowledge bases. The method has the advantages that the present network faults can be warned by the aid of instant information of the clusters and the nodes and multi-dimensional data such as the historical data and operation and maintenance conclusions, bases can be provided for operating and maintaining the devices, accordingly, the devices which need to be maintained in a highlighted manner can be found, and the faults of the devices can be prevented.

Description

A kind of distributed type assemblies equipment fault early-warning method
Technical field
The invention belongs to distributed data processing field, particularly relate to a kind of distributed type assemblies equipment fault early-warning method.
Background technology
In recent years, along with the integrated theory of cheap cluster is perfect, the practical experience implementing technology progressively improves.But because its theoretical foundation adopts cheapness, generic server to carry out horizontal expandable exactly, comparatively commercial server is high for the fault frequency of occurrences of cheap general individual server.In order to tackle the stable demand of data and service, need to carry out node redundancy.Because this type of cluster builds easily, advantage of lower cost, therefore the scope of application of cloud platform constantly expands, cluster server quantity easily tens, hundreds of.Large-scale office point even reaches more than thousand scales.
According to the achievement in research of Probability, even small probability event, the number of times that event occurs in respective counts magnitude will significantly increase, and substantially can reach a conclusion: in the scope of certain hour, single-point server failure inherently appears in large-scale cluster for this reason.Quantity along with fault machine does not stop to increase, and the load of residue machine can be caused to continue to increase, and impels again the fault frequency of occurrences of residue machine to increase.
For tackling above problem, special operation maintenance personnel can be set carry out regular visit process or add automatic monitoring script on this basis doing real-time informing, but this scheme all belongs to post, can not prejudge which machine may need emphasis O&M.
Secondly, general O&M process is all handling failure, release processing fault.Not by settling time between cluster state and node state, contacting spatially.
In addition, during cluster programming, its hardware configuration, number of nodes, flow topology, computing load are balanced, balanced Business Nature, the scale all carried with its cluster of memory load has direct relation.But substantially depend on the experience of scheme raiser during general cluster programming.Qualitative analysis, quantitatively conclusion can not be done.
Summary of the invention
Technical problem to be solved by this invention is to provide a kind of distributed type assemblies equipment fault early-warning method, carries out existing network fault pre-alarming, for the operation maintenance of equipment provides foundation, thus can find out the equipment needing emphasis to safeguard, prevent equipment from breaking down.
In order to solve the problems of the technologies described above, the invention provides a kind of distributed type assemblies equipment fault early-warning method, comprising:
Obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;
According to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;
Implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.
Preferably, the instant messages data of described acquisition cluster and node, comprising:
At network node on-premise network probe, gather instant network-related data; At each node system deploy system probe, acquisition system information data; At each service node deploy business probe, by operation layer software interface capturing service data.
Preferably, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.
Preferably, described method also comprises:
By described fault pre-alarming information feed back to system database, supplement as fault sample data.
The present invention is by the instant messages of cluster and node, and in conjunction with multi-dimensional data such as historical data, O&M conclusions, carry out existing network fault pre-alarming, the operation maintenance for equipment provides foundation, thus can find out the equipment needing emphasis to safeguard, prevents equipment from breaking down.The present invention and set up cluster and hardware configuration, number of nodes, flow topology by historical data, computing load is balanced, memory load is balanced etc., and related service expands associates, the planning for cluster provides scheme design considerations.When cluster programming, can historical data be searched, look at the failure condition of each node, or load capacity etc., plan according to historical data.
Accompanying drawing explanation
Fig. 1 is a kind of distributed type assemblies equipment fault early-warning system constituting method flow chart in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of all.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Main thought of the present invention is: first dispose the data that relevant data probe programmed acquisition is relevant, wherein 1) at relevant network node on-premise network probe, gather instant network-related data, 2) at each node system deploy system probe, gather the information datas such as cpu, internal memory, temperature, disk, 3) each service node deploy business probe, by operation layer software interface capturing service data.By real-time collecting module, above-mentioned data are stored in system database.
With reference to shown in Fig. 1, it is distributed type assemblies equipment fault early-warning method flow diagram a kind of in the embodiment of the present invention.Shown method comprises:
101, obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;
102, according to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;
103, implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.
In a preferred embodiment of the invention, the instant messages data of described acquisition cluster and node, comprising:
At network node on-premise network probe, gather instant network-related data; At each node system deploy system probe, acquisition system information data; At each service node deploy business probe, by operation layer software interface capturing service data.
In a preferred embodiment of the invention, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.
In a preferred embodiment of the invention, described method also comprises:
By described fault pre-alarming information feed back to system database, supplement as fault sample data.
The present invention according to assessment models, the knowledge base relevant according to historical data excavation, is regularly input to enforcement evaluating system by data-mining module.Implement evaluating system according to the real time information gathered, assessment models simultaneously, in conjunction with the knowledge base excavated, export relevant fault pre-alarming.The result of last early warning system process is fed back, and supplements as fault sample data.Whole system is self-iteration thus, progressively forms stable critic network.
Above-described embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only the specific embodiment of the present invention; the protection range be not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (4)

1. a distributed type assemblies equipment fault early-warning method, is characterized in that, comprising:
Obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;
According to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;
Implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.
2. the method for claim 1, is characterized in that, the instant messages data of described acquisition cluster and node, comprising:
At network node on-premise network probe, gather instant network-related data; At each node system deploy system probe, acquisition system information data; At each service node deploy business probe, by operation layer software interface capturing service data.
3. the method for claim 1, is characterized in that, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.
4. the method for claim 1, is characterized in that, described method also comprises:
By described fault pre-alarming information feed back to system database, supplement as fault sample data.
CN201510307233.0A 2015-06-08 2015-06-08 Method for warning faults of distributed cluster devices Pending CN104954181A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510307233.0A CN104954181A (en) 2015-06-08 2015-06-08 Method for warning faults of distributed cluster devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510307233.0A CN104954181A (en) 2015-06-08 2015-06-08 Method for warning faults of distributed cluster devices

Publications (1)

Publication Number Publication Date
CN104954181A true CN104954181A (en) 2015-09-30

Family

ID=54168556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510307233.0A Pending CN104954181A (en) 2015-06-08 2015-06-08 Method for warning faults of distributed cluster devices

Country Status (1)

Country Link
CN (1) CN104954181A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105515667A (en) * 2015-12-11 2016-04-20 浪潮(北京)电子信息产业有限公司 High-availability computer system
CN107391335A (en) * 2016-03-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for checking cluster health status
CN108092794A (en) * 2017-11-08 2018-05-29 北京百悟科技有限公司 Network failure processing method and device
CN108875207A (en) * 2018-06-15 2018-11-23 岭东核电有限公司 A kind of nuclear reactor optimum design method and system
CN108965049A (en) * 2018-06-28 2018-12-07 深信服科技股份有限公司 Method, equipment, system and the storage medium of cluster exception solution are provided
CN110955550A (en) * 2019-11-24 2020-04-03 济南浪潮数据技术有限公司 Cloud platform fault positioning method, device, equipment and storage medium
CN112650660A (en) * 2020-12-28 2021-04-13 北京中大科慧科技发展有限公司 Early warning method and device for power system of data center

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122374A (en) * 2011-03-03 2011-07-13 江苏方天电力技术有限公司 Intelligent analysis system for flow abnormity of power automation system
CN102663530A (en) * 2012-05-25 2012-09-12 中国南方电网有限责任公司超高压输电公司 Safety early warning and evaluating system for high-voltage direct current transmission system
CN104184819A (en) * 2014-08-29 2014-12-03 城云科技(杭州)有限公司 Multi-hierarchy load balancing cloud resource monitoring method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122374A (en) * 2011-03-03 2011-07-13 江苏方天电力技术有限公司 Intelligent analysis system for flow abnormity of power automation system
CN102663530A (en) * 2012-05-25 2012-09-12 中国南方电网有限责任公司超高压输电公司 Safety early warning and evaluating system for high-voltage direct current transmission system
CN104184819A (en) * 2014-08-29 2014-12-03 城云科技(杭州)有限公司 Multi-hierarchy load balancing cloud resource monitoring method

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105515667A (en) * 2015-12-11 2016-04-20 浪潮(北京)电子信息产业有限公司 High-availability computer system
CN107391335A (en) * 2016-03-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for checking cluster health status
CN107391335B (en) * 2016-03-31 2021-09-03 阿里巴巴集团控股有限公司 Method and equipment for checking health state of cluster
CN108092794A (en) * 2017-11-08 2018-05-29 北京百悟科技有限公司 Network failure processing method and device
CN108875207A (en) * 2018-06-15 2018-11-23 岭东核电有限公司 A kind of nuclear reactor optimum design method and system
CN108875207B (en) * 2018-06-15 2022-11-11 岭东核电有限公司 Nuclear reactor optimization design method and system
CN108965049A (en) * 2018-06-28 2018-12-07 深信服科技股份有限公司 Method, equipment, system and the storage medium of cluster exception solution are provided
CN108965049B (en) * 2018-06-28 2021-04-09 深信服科技股份有限公司 Method, device, system and storage medium for providing cluster exception solution
CN110955550A (en) * 2019-11-24 2020-04-03 济南浪潮数据技术有限公司 Cloud platform fault positioning method, device, equipment and storage medium
CN110955550B (en) * 2019-11-24 2022-07-08 济南浪潮数据技术有限公司 Cloud platform fault positioning method, device, equipment and storage medium
CN112650660A (en) * 2020-12-28 2021-04-13 北京中大科慧科技发展有限公司 Early warning method and device for power system of data center
CN112650660B (en) * 2020-12-28 2024-05-03 北京中大科慧科技发展有限公司 Early warning method and device for data center power system

Similar Documents

Publication Publication Date Title
CN104954181A (en) Method for warning faults of distributed cluster devices
CN107733986B (en) Protection operation big data supporting platform supporting integrated deployment and monitoring
CN104281130B (en) Hydroelectric equipment monitoring and fault diagnosis system based on big data technology
CN105095056B (en) A kind of method of data warehouse data monitoring
Yan et al. A fog computing solution for advanced metering infrastructure
CN103337012B (en) Towards the multi-threaded intelligent comprehensive alert analysis method of grid equipment monitoring
DE102016119100A1 (en) Data analysis services for distributed performance monitoring of industrial installations
DE102016119066A1 (en) Distributed performance monitoring and analysis platform for industrial plants
CN110806743A (en) Equipment fault detection and early warning system and method based on artificial intelligence
DE102016119084A9 (en) Distributed performance monitoring and analysis of industrial plants
CN105183609A (en) Real-time monitoring system and method applied to software system
CN102857371B (en) A kind of dynamic allocation management method towards group system
CN106210124B (en) A kind of unified cloud data center monitoring system
CN110990391A (en) Integration method and system of multi-source heterogeneous data, computer equipment and storage medium
CN102428447A (en) Method, device and system for displaying analysis result of essential cause analysis of failure
CN105653322B (en) The processing method of O&M server and server event
CN104637265A (en) Dispatch-automated multilevel integration intelligent watching alarming system
JP2013088828A (en) Facility periodic inspection support system using risk assessment
JP6530252B2 (en) Resource management system and resource management method
CN106817253A (en) The monitor in real time of journal file and the method and system of alarm
CN103325019A (en) Event-driven power grid fault information judgment method
CN110956282A (en) Power distribution automation defect management system and method
CN107658980A (en) A kind of analysis method and system for being used to check power system monitor warning information
CN111159152B (en) Secondary operation and data fusion method based on big data processing technology
CN103048054A (en) Data center temperature processing method based on high-density temperature acquisition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150930