CN104954181A - Method for warning faults of distributed cluster devices - Google Patents
Method for warning faults of distributed cluster devices Download PDFInfo
- Publication number
- CN104954181A CN104954181A CN201510307233.0A CN201510307233A CN104954181A CN 104954181 A CN104954181 A CN 104954181A CN 201510307233 A CN201510307233 A CN 201510307233A CN 104954181 A CN104954181 A CN 104954181A
- Authority
- CN
- China
- Prior art keywords
- data
- node
- instant messages
- probe
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a method for warning faults of distributed cluster devices. The method includes acquiring instant information data of clusters and nodes, storing the instant information data in system databases and supplementing the system databases with historical data by the instant information data; acquiring required data from the system databases according to requirements of evaluation models to form knowledge bases, periodically inputting the knowledge bases to implementation evaluation systems; outputting fault warning information by the aid of the implementation evaluation systems according to the instant information data, the evaluation models and the knowledge bases. The method has the advantages that the present network faults can be warned by the aid of instant information of the clusters and the nodes and multi-dimensional data such as the historical data and operation and maintenance conclusions, bases can be provided for operating and maintaining the devices, accordingly, the devices which need to be maintained in a highlighted manner can be found, and the faults of the devices can be prevented.
Description
Technical field
The invention belongs to distributed data processing field, particularly relate to a kind of distributed type assemblies equipment fault early-warning method.
Background technology
In recent years, along with the integrated theory of cheap cluster is perfect, the practical experience implementing technology progressively improves.But because its theoretical foundation adopts cheapness, generic server to carry out horizontal expandable exactly, comparatively commercial server is high for the fault frequency of occurrences of cheap general individual server.In order to tackle the stable demand of data and service, need to carry out node redundancy.Because this type of cluster builds easily, advantage of lower cost, therefore the scope of application of cloud platform constantly expands, cluster server quantity easily tens, hundreds of.Large-scale office point even reaches more than thousand scales.
According to the achievement in research of Probability, even small probability event, the number of times that event occurs in respective counts magnitude will significantly increase, and substantially can reach a conclusion: in the scope of certain hour, single-point server failure inherently appears in large-scale cluster for this reason.Quantity along with fault machine does not stop to increase, and the load of residue machine can be caused to continue to increase, and impels again the fault frequency of occurrences of residue machine to increase.
For tackling above problem, special operation maintenance personnel can be set carry out regular visit process or add automatic monitoring script on this basis doing real-time informing, but this scheme all belongs to post, can not prejudge which machine may need emphasis O&M.
Secondly, general O&M process is all handling failure, release processing fault.Not by settling time between cluster state and node state, contacting spatially.
In addition, during cluster programming, its hardware configuration, number of nodes, flow topology, computing load are balanced, balanced Business Nature, the scale all carried with its cluster of memory load has direct relation.But substantially depend on the experience of scheme raiser during general cluster programming.Qualitative analysis, quantitatively conclusion can not be done.
Summary of the invention
Technical problem to be solved by this invention is to provide a kind of distributed type assemblies equipment fault early-warning method, carries out existing network fault pre-alarming, for the operation maintenance of equipment provides foundation, thus can find out the equipment needing emphasis to safeguard, prevent equipment from breaking down.
In order to solve the problems of the technologies described above, the invention provides a kind of distributed type assemblies equipment fault early-warning method, comprising:
Obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;
According to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;
Implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.
Preferably, the instant messages data of described acquisition cluster and node, comprising:
At network node on-premise network probe, gather instant network-related data; At each node system deploy system probe, acquisition system information data; At each service node deploy business probe, by operation layer software interface capturing service data.
Preferably, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.
Preferably, described method also comprises:
By described fault pre-alarming information feed back to system database, supplement as fault sample data.
The present invention is by the instant messages of cluster and node, and in conjunction with multi-dimensional data such as historical data, O&M conclusions, carry out existing network fault pre-alarming, the operation maintenance for equipment provides foundation, thus can find out the equipment needing emphasis to safeguard, prevents equipment from breaking down.The present invention and set up cluster and hardware configuration, number of nodes, flow topology by historical data, computing load is balanced, memory load is balanced etc., and related service expands associates, the planning for cluster provides scheme design considerations.When cluster programming, can historical data be searched, look at the failure condition of each node, or load capacity etc., plan according to historical data.
Accompanying drawing explanation
Fig. 1 is a kind of distributed type assemblies equipment fault early-warning system constituting method flow chart in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of all.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Main thought of the present invention is: first dispose the data that relevant data probe programmed acquisition is relevant, wherein 1) at relevant network node on-premise network probe, gather instant network-related data, 2) at each node system deploy system probe, gather the information datas such as cpu, internal memory, temperature, disk, 3) each service node deploy business probe, by operation layer software interface capturing service data.By real-time collecting module, above-mentioned data are stored in system database.
With reference to shown in Fig. 1, it is distributed type assemblies equipment fault early-warning method flow diagram a kind of in the embodiment of the present invention.Shown method comprises:
101, obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;
102, according to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;
103, implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.
In a preferred embodiment of the invention, the instant messages data of described acquisition cluster and node, comprising:
At network node on-premise network probe, gather instant network-related data; At each node system deploy system probe, acquisition system information data; At each service node deploy business probe, by operation layer software interface capturing service data.
In a preferred embodiment of the invention, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.
In a preferred embodiment of the invention, described method also comprises:
By described fault pre-alarming information feed back to system database, supplement as fault sample data.
The present invention according to assessment models, the knowledge base relevant according to historical data excavation, is regularly input to enforcement evaluating system by data-mining module.Implement evaluating system according to the real time information gathered, assessment models simultaneously, in conjunction with the knowledge base excavated, export relevant fault pre-alarming.The result of last early warning system process is fed back, and supplements as fault sample data.Whole system is self-iteration thus, progressively forms stable critic network.
Above-described embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only the specific embodiment of the present invention; the protection range be not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (4)
1. a distributed type assemblies equipment fault early-warning method, is characterized in that, comprising:
Obtain the instant messages data of cluster and node, described instant messages data are stored into system database, supplement as historical data;
According to the needs of assessment models, in system database, obtain the data of needs, form knowledge base, described knowledge base is regularly input to enforcement evaluating system;
Implement evaluating system according to described instant messages data, assessment models, knowledge base, export fault pre-alarming information.
2. the method for claim 1, is characterized in that, the instant messages data of described acquisition cluster and node, comprising:
At network node on-premise network probe, gather instant network-related data; At each node system deploy system probe, acquisition system information data; At each service node deploy business probe, by operation layer software interface capturing service data.
3. the method for claim 1, is characterized in that, described system information data comprise following in the combination of one or more than one: cpu, internal memory, temperature, data in magnetic disk.
4. the method for claim 1, is characterized in that, described method also comprises:
By described fault pre-alarming information feed back to system database, supplement as fault sample data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510307233.0A CN104954181A (en) | 2015-06-08 | 2015-06-08 | Method for warning faults of distributed cluster devices |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510307233.0A CN104954181A (en) | 2015-06-08 | 2015-06-08 | Method for warning faults of distributed cluster devices |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104954181A true CN104954181A (en) | 2015-09-30 |
Family
ID=54168556
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510307233.0A Pending CN104954181A (en) | 2015-06-08 | 2015-06-08 | Method for warning faults of distributed cluster devices |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104954181A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105515667A (en) * | 2015-12-11 | 2016-04-20 | 浪潮(北京)电子信息产业有限公司 | High-availability computer system |
CN107391335A (en) * | 2016-03-31 | 2017-11-24 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus for checking cluster health status |
CN108092794A (en) * | 2017-11-08 | 2018-05-29 | 北京百悟科技有限公司 | Network failure processing method and device |
CN108875207A (en) * | 2018-06-15 | 2018-11-23 | 岭东核电有限公司 | A kind of nuclear reactor optimum design method and system |
CN108965049A (en) * | 2018-06-28 | 2018-12-07 | 深信服科技股份有限公司 | Method, equipment, system and the storage medium of cluster exception solution are provided |
CN110955550A (en) * | 2019-11-24 | 2020-04-03 | 济南浪潮数据技术有限公司 | Cloud platform fault positioning method, device, equipment and storage medium |
CN112650660A (en) * | 2020-12-28 | 2021-04-13 | 北京中大科慧科技发展有限公司 | Early warning method and device for power system of data center |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102122374A (en) * | 2011-03-03 | 2011-07-13 | 江苏方天电力技术有限公司 | Intelligent analysis system for flow abnormity of power automation system |
CN102663530A (en) * | 2012-05-25 | 2012-09-12 | 中国南方电网有限责任公司超高压输电公司 | Safety early warning and evaluating system for high-voltage direct current transmission system |
CN104184819A (en) * | 2014-08-29 | 2014-12-03 | 城云科技(杭州)有限公司 | Multi-hierarchy load balancing cloud resource monitoring method |
-
2015
- 2015-06-08 CN CN201510307233.0A patent/CN104954181A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102122374A (en) * | 2011-03-03 | 2011-07-13 | 江苏方天电力技术有限公司 | Intelligent analysis system for flow abnormity of power automation system |
CN102663530A (en) * | 2012-05-25 | 2012-09-12 | 中国南方电网有限责任公司超高压输电公司 | Safety early warning and evaluating system for high-voltage direct current transmission system |
CN104184819A (en) * | 2014-08-29 | 2014-12-03 | 城云科技(杭州)有限公司 | Multi-hierarchy load balancing cloud resource monitoring method |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105515667A (en) * | 2015-12-11 | 2016-04-20 | 浪潮(北京)电子信息产业有限公司 | High-availability computer system |
CN107391335A (en) * | 2016-03-31 | 2017-11-24 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus for checking cluster health status |
CN107391335B (en) * | 2016-03-31 | 2021-09-03 | 阿里巴巴集团控股有限公司 | Method and equipment for checking health state of cluster |
CN108092794A (en) * | 2017-11-08 | 2018-05-29 | 北京百悟科技有限公司 | Network failure processing method and device |
CN108875207A (en) * | 2018-06-15 | 2018-11-23 | 岭东核电有限公司 | A kind of nuclear reactor optimum design method and system |
CN108875207B (en) * | 2018-06-15 | 2022-11-11 | 岭东核电有限公司 | Nuclear reactor optimization design method and system |
CN108965049A (en) * | 2018-06-28 | 2018-12-07 | 深信服科技股份有限公司 | Method, equipment, system and the storage medium of cluster exception solution are provided |
CN108965049B (en) * | 2018-06-28 | 2021-04-09 | 深信服科技股份有限公司 | Method, device, system and storage medium for providing cluster exception solution |
CN110955550A (en) * | 2019-11-24 | 2020-04-03 | 济南浪潮数据技术有限公司 | Cloud platform fault positioning method, device, equipment and storage medium |
CN110955550B (en) * | 2019-11-24 | 2022-07-08 | 济南浪潮数据技术有限公司 | Cloud platform fault positioning method, device, equipment and storage medium |
CN112650660A (en) * | 2020-12-28 | 2021-04-13 | 北京中大科慧科技发展有限公司 | Early warning method and device for power system of data center |
CN112650660B (en) * | 2020-12-28 | 2024-05-03 | 北京中大科慧科技发展有限公司 | Early warning method and device for data center power system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104954181A (en) | Method for warning faults of distributed cluster devices | |
CN107733986B (en) | Protection operation big data supporting platform supporting integrated deployment and monitoring | |
CN104281130B (en) | Hydroelectric equipment monitoring and fault diagnosis system based on big data technology | |
CN105095056B (en) | A kind of method of data warehouse data monitoring | |
Yan et al. | A fog computing solution for advanced metering infrastructure | |
CN103337012B (en) | Towards the multi-threaded intelligent comprehensive alert analysis method of grid equipment monitoring | |
DE102016119100A1 (en) | Data analysis services for distributed performance monitoring of industrial installations | |
DE102016119066A1 (en) | Distributed performance monitoring and analysis platform for industrial plants | |
CN110806743A (en) | Equipment fault detection and early warning system and method based on artificial intelligence | |
DE102016119084A9 (en) | Distributed performance monitoring and analysis of industrial plants | |
CN105183609A (en) | Real-time monitoring system and method applied to software system | |
CN102857371B (en) | A kind of dynamic allocation management method towards group system | |
CN106210124B (en) | A kind of unified cloud data center monitoring system | |
CN110990391A (en) | Integration method and system of multi-source heterogeneous data, computer equipment and storage medium | |
CN102428447A (en) | Method, device and system for displaying analysis result of essential cause analysis of failure | |
CN105653322B (en) | The processing method of O&M server and server event | |
CN104637265A (en) | Dispatch-automated multilevel integration intelligent watching alarming system | |
JP2013088828A (en) | Facility periodic inspection support system using risk assessment | |
JP6530252B2 (en) | Resource management system and resource management method | |
CN106817253A (en) | The monitor in real time of journal file and the method and system of alarm | |
CN103325019A (en) | Event-driven power grid fault information judgment method | |
CN110956282A (en) | Power distribution automation defect management system and method | |
CN107658980A (en) | A kind of analysis method and system for being used to check power system monitor warning information | |
CN111159152B (en) | Secondary operation and data fusion method based on big data processing technology | |
CN103048054A (en) | Data center temperature processing method based on high-density temperature acquisition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150930 |