CN107391335A - A kind of method and apparatus for checking cluster health status - Google Patents
A kind of method and apparatus for checking cluster health status Download PDFInfo
- Publication number
- CN107391335A CN107391335A CN201710205541.1A CN201710205541A CN107391335A CN 107391335 A CN107391335 A CN 107391335A CN 201710205541 A CN201710205541 A CN 201710205541A CN 107391335 A CN107391335 A CN 107391335A
- Authority
- CN
- China
- Prior art keywords
- checkpoint
- cluster
- occurrence
- updated
- monitoring data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/147—Network analysis or design for predicting network behaviour
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The purpose of the application is to provide a kind of method and apparatus for checking cluster health status, by the relevant information for obtaining cluster to be checked;Obtain the problem of at least one to be checked and its corresponding inspection rule;Based on the relevant information of the cluster, the monitoring data of the checkpoint related to the inspection rule is obtained from the cluster, and polymerization processing is carried out to the monitoring data to obtain result;Corresponding described problem is transferred based on the result, and the relevant information based on described problem generates and feeds back healthy early warning information, realize the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and improve the degree of accuracy of the health status anticipation to each checkpoint corresponding to the described problem in cluster, the real-time that more checkpoint monitoring are carried out to the distributed file system on line is also improved simultaneously, and reaches the purpose for carrying out being alarmed multiple checkpoints in advance.
Description
Technical field
The application is related to computer realm, more particularly to a kind of technology for being used to check cluster health status.
Background technology
In the distributed type assemblies warning system, exploded with the mass data of user equipment, distributed file system
The scale of (Distributed File System) is also constantly increasing;It is but old with cluster where distributed file system
Change and the continuous growth of business, various problems emerge in an endless stream, and the list that often individual server in a clustered node occurs
Point problem is likely to accumulate and cause very big failure;But carried out when problem happens suddenly by the platform where warning system
Alarm, with wake up attendant investigated and perform the method solved the problems, such as may because of miss solve the problems, such as it is optimal when
Between and trigger failure.
In the prior art, distributed type assemblies warning system is respectively to the hardware of the single service equipment under each clustered node
(for example, local module in internal memory, hard disk or software entity) and operating system carry out single-point alarm, when single-point goes wrong
Alarmed, and by largely alarm simple abnormal alarm information is carried out by service equipment acquisition after unified alarm to safeguarding
Personnel.Because distributed type assemblies warning system of the prior art is only alarmed when single-point goes wrong, therefore before alarm
If alarm threshold value set pine to be likely to result in triggering failure, and alarm threshold value set and can sternly cause largely to report by mistake;And by
Reported in distributed type assemblies warning system of the prior art mainly for the hardware of service equipment and the single-point of operating system
It is alert, availability, performance and service quality of distributed file system etc. are not judged, with causing one-sidedness to whole
Distributed file system is alarmed, and causes the alarm degree of accuracy low;Again due to distributed type assemblies warning system of the prior art
Only it is that simply substantial amounts of abnormal alarm acquisition of information and unification are alarmed to attendant, to treat that attendant is investigated simultaneously
Solve the problems, such as, causing to alarm, the degree of accuracy is low and poor real.
Therefore, in the prior art using distributed type assemblies warning system under each clustered node in distributed file system
Single service equipment hardware and operating system produced problem carry out single-point alarm, causing to alarm, the degree of accuracy is low and real-time
Difference.
The content of the invention
The purpose of the application is to provide a kind of method and apparatus for being used to check cluster health status, to solve prior art
The middle hardware using distributed type assemblies warning system to the single service equipment under each clustered node in distributed file system
Single-point alarm is carried out with operating system produced problem, causing to alarm, the degree of accuracy is low and the problem of poor real.
According to the one side of the application, there is provided a kind of method for checking cluster health status, including:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the prison for checking regular related checkpoint
Data are controlled, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back strong
Health warning information.
Further, polymerization processing is carried out to the monitoring data to be included with obtaining result:
Based on it is described to be checked the problem of it is corresponding check rule, the monitoring data of each checkpoint is carried out respectively
Processing, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
According to the one side of the application, there is provided a kind of method for checking cluster health status, in addition to:
Establishment problem rule base, described problem rule base include at least one problem and its corresponding inspection rule;
The problem of in described problem rule base and its corresponding inspection rule are updated.
Further, in described problem rule base the problem of and its corresponding inspection rule be updated including:
Obtain the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;
Threshold value is initially monitored based on described, when the appearance of the problem to be updated is obtained from the relevant information of the cluster
Between in setting time section before point and time of occurrence point all checkpoints monitoring data, based on the monitoring number
According to determining and recording the abnormal checkpoint;
When the problem to be updated occurs in each setting time section, based on institute in presently described setting time section
The abnormal checkpoint of record and the abnormal checkpoint of historical record, update each checkpoint and treated described
Probability of occurrence when replacement problem occurs;
Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, described in renewal
The inspection rule of problem to be updated.
According to the another aspect of the application, a kind of equipment for checking cluster health status is additionally provided, including:
Information acquisition device, for obtaining the relevant information of cluster to be checked;
Rule device, for obtaining the problem of at least one to be checked and its corresponding inspection rule;
Processing unit is monitored, for the relevant information based on the cluster, is obtained and the check gauge from the cluster
The then monitoring data of related checkpoint, and polymerization processing is carried out to the monitoring data to obtain result;
Early warning feedback device, for transferring corresponding described problem, and the phase based on described problem based on the result
Information is closed to generate and feed back healthy early warning information.
Further, the monitoring processing unit includes:
Data processing unit, for based on it is described to be checked the problem of it is corresponding check rule to each checkpoint
Monitoring data is respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing knot
Fruit.
According to the one side of the application, there is provided a kind of equipment for checking cluster health status, in addition to:
Rules device is created, for creating problem rule base, described problem rule base includes at least one problem and its right
The inspection rule answered;
Policy Updates device, the problem of in described problem rule base and its corresponding inspection rule is carried out more
Newly.
Further, the Policy Updates device includes:
First information acquiring unit, for obtaining the relevant information, problem to be updated and its initial prison of cluster to be checked
Control threshold value;
First recording unit, for initially monitoring threshold value based on described, from the relevant information of the cluster described in acquisition
The monitoring number of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of problem to be updated
According to determining and recording the abnormal checkpoint based on the monitoring data;
First probability updating unit, during for the problem to be updated to occur in each setting time section, it is based on
The abnormal checkpoint recorded in presently described setting time section and the abnormal checkpoint of historical record, renewal
Probability of occurrence of each checkpoint when the problem to be updated occurs;
First Policy Updates unit, for based on the checkpoint of the probability of occurrence after renewal higher than setting probability
And its relevant information, the inspection for updating the problem to be updated are regular.
In addition, present invention also provides a kind of equipment for checking cluster health status, including:
Processor;
And it is arranged to store the memory of computer executable instructions, the executable instruction makes institute when executed
State processor:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the prison for checking regular related checkpoint
Data are controlled, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back strong
Health warning information.
Compared with prior art, a kind of of embodiments herein offer is used to check the method for cluster health status and set
It is standby, by the relevant information for obtaining cluster to be checked;Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, the monitoring data to the regular related checkpoint of the inspection is obtained from the cluster,
And polymerization processing is carried out to the monitoring data to obtain result;Corresponding described problem is transferred based on the result,
And the relevant information based on described problem generates and feeds back healthy early warning information.Because the distributed file system on to line is entered
Before the anticipation of row health status, will occur the problem of abnormal to be checked on the line of distributed file system as far as possible is carried out correspondingly
Regularization corresponding check rule to obtain the problem of described to be checked so that distributed file system on to line is carried out
When health status prejudges, monitoring data corresponding to each checkpoint can be directly obtained, and check rule to the inspection using described
The monitoring data made an inventory of carries out result of the polymerization processing to be handled, and improves and multiple inspections under each clustered node are clicked through
The degree of accuracy of row health status monitoring, and corresponding described problem, and the related letter based on described problem are transferred based on result
Breath generates and feeds back healthy early warning information, to treat the healthy early warning information of the attendant based on feedback under each clustered node
The each checkpoint pinpointed the problems gives warning in advance and handles relevant health warning information, so as to improve to the distribution text on line
Part system carries out the real-time of more checkpoint monitoring, and reaches the purpose of Multiple Dots Alarm Rules in advance;Further, to the monitoring number
Included according to polymerization processing is carried out with obtaining result:Based on it is described to be checked the problem of it is corresponding check rule, to each institute
The monitoring data for stating checkpoint is respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and anti-
Result being presented, realizing the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and is improved pair
The degree of accuracy of the health status anticipation of each checkpoint corresponding to described problem in cluster.
Further, a kind of method and apparatus for checking cluster health status that embodiments herein provides, also
By creating problem rule base, described problem rule base includes at least one problem and its corresponding inspection rule;Ask described
The problem of in topic rule base and its corresponding inspection rule are updated, and ensure that to most in the distributed file system on line
The each checkpoint for being likely to occur problem check the establishment of rule, and the monitoring data based on each checkpoint is to described problem
The problem of in rule base and its corresponding inspection rule are updated, with ensure create the problem of rule base in can be more comprehensively
Abnormal examination point in more accurate reaction profile formula file system, and realize corresponding multiple inspections during to described problem occur
The monitoring of the health status of point, and improve the health status anticipation to each checkpoint corresponding to the described problem in cluster
The degree of accuracy and real-time.
Further, in described problem rule base the problem of and its corresponding inspection rule be updated including:Obtain
Take the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;Based on the initial monitoring threshold value, from institute
State the setting time before time of occurrence point and time of occurrence point that the problem to be updated is obtained in the relevant information of cluster
The monitoring data of all checkpoints, the abnormal checkpoint is determined and recorded based on the monitoring data in section;Every
It is abnormal based on being recorded in presently described setting time section when the problem to be updated occurring in the individual setting time section
The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated
When probability of occurrence;It is higher than the checkpoint and its relevant information of setting probability based on the probability of occurrence after renewal, more
The inspection rule of the new problem to be updated so that pass through the setting time section before the time of occurrence point to the problem to be updated
The monitoring data of interior all checkpoints, prejudged based on the initial monitoring threshold value, and in each setting
Between when the problem to be updated occurring in section, based on the abnormal checkpoint recorded in presently described setting time section with
The abnormal checkpoint of historical record, it is general to update appearance of each checkpoint when the problem to be updated occurs
Rate, so as to which the probability of occurrence after renewal to be higher than to the checkpoint and its relevant information of setting probability, treated described in renewal
The inspection rule of replacement problem, so as to update described problem rule base by updating the inspection rule of the problem to be updated,
The abnormal examination point that described problem rule base is more comprehensively more accurately reflected in distributed file system, and realization pair
There is the monitoring of the health status of corresponding multiple checkpoints during described problem, and improve corresponding to the described problem in cluster
Each checkpoint health status anticipation the degree of accuracy and real-time.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 shows a kind of method flow schematic diagram for being used to check cluster health status according to the application one side;
Fig. 2 shows corresponding wound in a kind of method for checking cluster health status according to the application another aspect
Build the method flow schematic diagram of problem rule base;
Fig. 3 is shown according to corresponding in a kind of method for checking cluster health status provided in the embodiment of the application one
The method flow schematic diagram for creating step S16 corresponding to problem base rule;
Fig. 4 is shown according to right in a kind of method for checking cluster health status provided in the application another embodiment
That answers creates step S16 corresponding to problem base rule method flow schematic diagram;
Fig. 5 shows a kind of device structure schematic diagram for being used to check cluster health status according to the application one side;
Fig. 6 shows corresponding wound in a kind of equipment for checking cluster health status according to the application another aspect
Build the device structure schematic diagram of problem rule base;
Fig. 7 is shown according to a kind of rule for being used to check in the equipment of cluster health status provided in the embodiment of the application one
The then structural representation of updating device 16;
Fig. 8 is shown according in a kind of equipment for checking cluster health status provided in the application another embodiment
The structural representation of Policy Updates device 16.
Same or analogous reference represents same or analogous part in accompanying drawing.
Embodiment
The application is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 shows a kind of method flow schematic diagram for being used to check cluster health status according to the application one side.
The method comprising the steps of S11, step S12, step S13 and step S14.
Wherein, the step S11:Obtain the relevant information of cluster to be checked;The step S12:Obtain at least one
The problem of to be checked and its corresponding inspection rule;The step S13:Based on the relevant information of the cluster, from the cluster
Middle acquisition and the monitoring data for checking regular related checkpoint, and polymerization processing is carried out to the monitoring data to obtain
Result;The step S14:Corresponding described problem, and the related letter based on described problem are transferred based on the result
Breath generates and feeds back healthy early warning information.
In embodiments herein, the cluster to be checked in the step S11 is to be located at distributed field system
On one or more of system clustered node, wherein, the distributed file system refers to the physical store of file system management
Resource is not necessarily connected directly between local node, but is connected by computer network with node.File system in a distributed manner below
Specific embodiment is carried out exemplified by system to the application to explain in detail.Certainly, herein using exemplified by distributed file system to this Shen
Come in row specific embodiment explains in detail, purpose only by way of example, embodiments herein not limited to this, in others
Following embodiments can be equally realized in distributed cluster system.
Further, the checkpoint includes following at least any one:In hardware device, the cluster in the cluster
Software equipment local module.
It should be noted that the checkpoint in the step S13 can including but not limited to include distributed text
The office of software equipment in the hardware device and distributed file system of the individual server under each clustered node in part system
Portion's module.Wherein, the hardware device of the server include central processing unit, internal memory, hard disk, chipset,
Input/output bus, input-output equipment, power supply and cabinet etc., the local module of the software equipment include but is not limited to system
Program module, fault diagnostic program module and exception handles module etc. are set.Certainly, other are existing or may go out from now on
The existing checkpoint is such as applicable to the application, should also be included within the application protection domain, and herein by reference
It is incorporated herein.
Further, the step S11 includes:Obtain the relevant information of cluster to be checked;Specifically, the step
S11 includes:The request submitted based on user, the relevant information of cluster to be checked is obtained, wherein, the relevant information includes:
Cluster position information and review time section.
In embodiments herein, obtained when needing and the health status of the distributed file system on line being monitored
The request of family submission is taken, the cluster position information of the cluster to be checked of the acquisition request based on user's submission and monitoring institute
State the review time section of cluster to be checked, wherein the cluster position information and the review time section belong to it is described to be checked
The relevant information for the cluster looked into.
For example, pair that user submits is obtained when needing to be monitored the health status of the distributed file system on line
The request that each checkpoint in cluster is monitored, where cluster to be checked corresponding to the acquisition request submitted based on user
Cluster position information and obtain multiple checkpoints monitoring data corresponding to one or more review time sections, wherein, institute
The actual geographic position range that cluster position information can be distributed across where the clustered node of different zones is stated, can also be same
Actual geographic position range where the clustered node of one region.
Further, the step S12 includes:Obtain the problem of at least one to be checked and its corresponding inspection rule;
Specifically, the step S12 includes:The problem of at least one to be checked and its corresponding check gauge are obtained from problem rule base
Then.
It should be noted that the problem of described problem rule base in the step S12 mainly includes having had built up and
It is multiple corresponding to it to check rule.Wherein, described problem include RAM leakage, read-write long-tail, loss of data,
With system performance problems, with system availability problem and service quality problem etc.;The inspection is regular including checkpoint and its right
The outlier threshold for the monitoring data answered.Certainly, other described problem rule bases that are existing or being likely to occur from now on are for example applicable
In the application, should also be included within the application protection domain, and be incorporated herein by reference herein.
For example, the problem of described problem rule base is present is memory overflow, then check that rule includes corresponding to it:The inspection
The rate of change and its corresponding outlier threshold, the checkpoint made an inventory of as nearly one week traffic pressure are establishment file total amount and its right
The outlier threshold answered and the checkpoint are that internal memory uses growth slope and its corresponding outlier threshold;Described problem rule stock
The problem of for read-write long-tail, then check that rule includes corresponding to it:The checkpoint be nearly one week read-write calling frequency and
Its corresponding outlier threshold, the checkpoint are the retransmission rate of network and its corresponding outlier threshold and the checkpoint in cluster
For the disk health status score information in cluster and its corresponding outlier threshold.
Further, being obtained from the cluster and the inspection regular related checkpoint in the step S13
Monitoring data includes:Based on cluster described in the cluster position information searching, and obtain in the cluster and check rule with described
Related checkpoint;The monitoring number of the related checkpoint in the review time section is obtained from the monitoring module of the cluster
According to.
It should be noted that the monitoring module of the cluster is mainly responsible each from the monitoring system acquisition in the cluster
The monitoring data of each checkpoint of hardware device and software equipment correlation.Certainly, other are existing or are from now on likely to occur
The monitoring module of the cluster is such as applicable to the application, should also be included within the application protection domain, and herein with reference
Mode is incorporated herein.
In above-described embodiment of the application, if the cluster position information is Shanghai geography position in the step S13
Confidence ceases, then finds the Shanghai cluster based on the Shanghai geographical location information, and obtain from the Shanghai cluster with
Each checkpoint for checking that rule is related;Then obtained from the monitoring module of the Shanghai cluster in the review time section
The monitoring data of related each checkpoint, the then monitoring data for having the checkpoint got to be establishment file total amount are
34th, the checkpoint be internal memory using increase slope monitoring data be 48%, the checkpoint be nearly one week traffic pressure
The monitoring data of rate of change is 1%, and it is 75.6% that the monitoring data of frequency is called in the read-write that the checkpoint is nearly one week, described
Checkpoint is that the monitoring data of the retransmission rate of network in cluster is 5.3%, and the checkpoint is the disk health status in cluster
The monitoring data of score information is 15.
Further, the polymerization processing that carried out to the monitoring data in the step S13 is included with obtaining result:
Based on it is described to be checked the problem of it is corresponding check rule, the monitoring data of each checkpoint is respectively processed, with
Obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
In above-described embodiment of the application, in the step S133 can by based on it is described to be checked the problem of pair
The inspection rule answered, is respectively compared the monitoring datas of multiple checkpoints to judge whether the problem of described to be checked.It is if pre-
Sentence the problem of distributed file system on line whether there is RAM leakage, then can be by respectively to nearly one week traffic pressure
Rate of change, establishment file total amount, internal memory are corresponding described using monitoring data progress corresponding to these three checkpoints of slope is increased
The matching of rule is checked, to obtain result to be prejudged;If the distributed file system on line is prejudged with the presence or absence of reading
The problem of writing long-tail, then can be by calling frequency to the read-write of nearly one week respectively, the retransmission rate of network in cluster, in cluster
Monitoring data corresponding to these three checkpoints of disk health status score information carry out the corresponding matching for checking rule with
Result is obtained to be prejudged.
If for example, in the inspection rule of RAM leakage, the checkpoint is establishment file described to be checked the problem of
The outlier threshold of total amount is 30, because monitoring data that the checkpoint is establishment file total amount is 34 to have exceeded the abnormal threshold
Value 30, then the checkpoint is that establishment file total amount exception occurs;The checkpoint is that internal memory uses the abnormal threshold for increasing slope
It is worth for 20%, is 48% to exceed the outlier threshold because the checkpoint is internal memory using the monitoring data for increasing slope
20%, then the checkpoint is that internal memory exception occurs using slope is increased, and the checkpoint is the change of nearly one week traffic pressure
The outlier threshold of rate is 14%, because the monitoring data for the rate of change that the checkpoint is nearly one week traffic pressure is 1% to be less than
The outlier threshold 14%, then the checkpoint is normal for the rate of change of nearly one week traffic pressure;If asked in described to be checked
In the inspection rule of entitled read-write long-tail, the outlier threshold for the read-write calling frequency that the checkpoint is nearly one week is 30%, by
It is 75.6% to exceed the outlier threshold 30% that the monitoring data of frequency is called in the read-write for being nearly one week in the checkpoint, then institute
Stating the read-write that checkpoint is nearly one week calls frequency exception occur, and the checkpoint is the abnormal threshold of the retransmission rate of network in cluster
It is worth for 10%, because monitoring data that the checkpoint is the retransmission rate of network in cluster is 5.3% to be less than the outlier threshold
10%, then the checkpoint is that the retransmission rate of network in cluster is normal, and the checkpoint is the disk health status point in cluster
The outlier threshold of value information is 60, because the monitoring data that the checkpoint is the disk health status score information in cluster is
15 are less than the outlier threshold 60, then the checkpoint is that the disk health status score information in cluster is normal, therefore is obtained
Result the problem of being described to be checked be to check that the checkpoint in rule is establishment file corresponding to RAM leakage
There is abnormal, described checkpoint and the problem of abnormal, described to be checked occurs using slope is increased for read-write long-tail for internal memory in total amount
It is abnormal that the read-write that the corresponding checkpoint checked in rule is nearly one week calls frequency to occur.
In above-described embodiment of the application, the problem of described to be checked corresponding inspection is based in the step S13
After rule is handled the monitoring data of each checkpoint respectively, corresponding result is obtained;Then, in the step
In S14, corresponding described problem is transferred based on the result, because the result is:The problem of described to be checked is
It is that exception occurs in establishment file total amount, the checkpoint is that internal memory makes that the checkpoint in rule is checked corresponding to RAM leakage
It is to read and write to check that the checkpoint in rule is near corresponding to long-tail with increasing slope the problem of abnormal, described to be checked occur
It is abnormal that the read-write of one week calls frequency to occur, then it is RAM leakage and read-write long-tail to transfer corresponding described problem, in the step
In S14, the relevant information based on described problem generates and feeds back healthy early warning information.
Further, the relevant information of described problem includes at least any one of following:It is the time of occurrence of described problem, each
The monitoring data of the related checkpoint, occur that the monitoring data abnormal checkpoint occurs during described problem.
Then above-described embodiment, the relevant information generation healthy early warning information based on described problem, then have the health pre-
Alert information includes described problem and its corresponding time of occurrence and each described of monitoring data exception occurs when there is described problem
Checkpoint and its monitoring data.
For example, it is based on the result:The problem of described to be checked, is in inspection rule corresponding to RAM leakage
The checkpoint is that exception occurs in establishment file total amount, the checkpoint is that internal memory is abnormal using slope appearance is increased, described to treat
The problem of inspection is to check that the read-write that the checkpoint in rule is nearly one week calls frequency appearance different corresponding to read-write long-tail
Often, then it is RAM leakage and read-write long-tail to transfer corresponding described problem;And based on the relevant information of described problem according to described point
Early warning report template in cloth file system generates and feeds back the healthy early warning information, wherein the healthy early warning generated
Information is { { RAM leakage:In t1, establishment file total amount is that 34 appearance are abnormal, and in t2, internal memory is gone out using slope 48% is increased
It is now abnormal };{ read-write long-tail:In t3, the read-write of nearly one week calls frequency 75.6% exception occur } }, to feed back to system maintenance
Personnel, to treat the healthy early warning information each inspection to the cluster under pinpoint the problems of the system maintenance personnel based on feedback
Point is given warning in advance and handles relevant health warning information, and more checkpoints are carried out to the distributed file system on line so as to improve
The real-time of monitoring, and reach the early warning to multiple checkpoints in the cluster in advance and handle the mesh of healthy early warning information
, also improve the degree of accuracy of the health status anticipation to each checkpoint corresponding to the described problem in cluster.
In the step S14, if the monitoring data of each checkpoint is all without super in all results
The outlier threshold is crossed, then generates health status information, to treat that the distributed file system attendant understands whole distribution
Formula file system is in health status, without carrying out healthy early warning processing.
In embodiments herein, described problem rule base is being utilized to each cluster in distributed sort system
Checkpoint monitoring data carry out polymerization calculating during, it is also necessary to described problem rule base carry out create and constantly
Renewal is as shown in Figure 2.
Fig. 2 shows corresponding wound in a kind of method for checking cluster health status according to the application another aspect
Build the method flow schematic diagram of problem rule base.The method comprising the steps of S15 and step S16.
Wherein, the step S15 includes:Establishment problem rule base, described problem rule base include at least one problem and
Rule is checked corresponding to it;The step S16 includes:The problem of in described problem rule base and its corresponding inspection rule
It is updated.
In embodiments herein, the monitoring data of each checkpoint in the distributed file system is carried out
Before polymerization calculates, described problem rule base need to be created, wherein described problem rule base includes at least one problem and each institute
State and rule is checked corresponding to problem, in the exception for checking rule and including at least one checkpoint and each checkpoint
Threshold value, i.e., had before problem occurs multiple checkpoints occur it is abnormal, and based on occur abnormal multiple checkpoints come
There is abnormal corresponding described problem in anticipation.
For example, described problem rule base includes problem 1, problem 2 and problem 3, wherein, examined corresponding to described problem 1
It is { problem 1 to look into rule:Checkpoint A outlier threshold is A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold
For C1 };Check that rule is { problem 2 corresponding to described problem 2:Checkpoint D outlier threshold is D1, checkpoint E outlier threshold
Outlier threshold for E1 and checkpoint F is F1 };Check that rule is { problem 3 corresponding to described problem 3:Checkpoint G abnormal threshold
The outlier threshold being worth for G1 and checkpoint H is G1 }.
With exploding for user's mass data, the scale of the distributed file system is also constantly increasing, due to rule
It is actual before a problem occurs during the anticipation of the health status of distributed file system of the mould on ever-increasing line
Have multiple checkpoints and occur exception in advance, then need each inspection in the fixed time period before occurring according to described problem
The abnormal monitoring data of point is iterated calculating, can most react corresponding check gauge when described problem occurs abnormal to find
Then, as shown in Figure 3.
Fig. 3 is shown according to corresponding in a kind of method for checking cluster health status provided in the embodiment of the application one
The method flow schematic diagram for creating step S16 corresponding to problem base rule.This method includes:Step S161, step S162, step
Rapid S163 and step S164.
Wherein, the step S161 obtains the relevant information of cluster to be checked, problem to be updated and its initially monitors threshold
Value;The step S162 is based on the initial monitoring threshold value, and the problem to be updated is obtained from the relevant information of the cluster
Time of occurrence point and time of occurrence point before setting time section in all checkpoints monitoring data, based on institute
State monitoring data and determine and record the abnormal checkpoint;In each setting time section institute occurs for the step S163
The abnormal checkpoint recorded when stating problem to be updated based on presently described setting time section and the exception of historical record
The checkpoint, update the probability of occurrence of each checkpoint when the problem to be updated occurs;The step S164
Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, renewal is described to be updated to ask
The inspection rule of topic.
In embodiments herein, when being updated need in described problem rule base the problem of, first, step
The problem to be updated that S161 obtains the cluster position information and review time section of cluster to be checked and is trained
And its corresponding initial monitoring threshold value;Then, the step S162 is believed based on the initial monitoring threshold value from the cluster position
Cluster corresponding to breath is lower obtains the time of occurrence point of the problem to be updated in the review time section and the time of occurrence
The monitoring data of all checkpoints in setting time section before point, and the abnormal checkpoint is occurred into monitoring data and remembered
Record is got off;Then, when in each setting time section the problem to be updated occurs for the step S163, based on current institute
The abnormal checkpoint recorded in setting time section and the abnormal checkpoint of historical record are stated, updates each institute
State probability of occurrence of the checkpoint when the problem to be updated occurs;Finally, the step S164 is based on going out described in after renewal
The checkpoint of the existing probability higher than setting probability and its relevant information, the inspection rule of the problem to be updated is updated, so as to
Described problem rule base is updated by updating the inspection rule of the problem to be updated so that described problem rule base can be more
Comprehensively more accurately reflection distributed file system in abnormal examination point, and realize to there is described problem when it is corresponding multiple
The monitoring of the health status of checkpoint, and improve pre- to the health status of each checkpoint corresponding to the described problem in cluster
The degree of accuracy sentenced and real-time.
Further, the initial monitoring threshold value includes:The outlier threshold of the monitoring data of all checkpoints and go out
The now weight threshold of the abnormal checkpoint;The step 162 includes:Based on the initial monitoring threshold value, from the cluster
Relevant information in obtain institute in the setting time section before the time of occurrence point and time of occurrence point of the problem to be updated
There is the monitoring data of the checkpoint, the abnormal checkpoint is determined and recorded based on the monitoring data;Specifically, it is described
Step 162 includes:The outlier threshold of monitoring data based on all checkpoints, is obtained from the relevant information of the cluster
The prison of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of the problem to be updated
Data are controlled, and the weight of the checkpoint of recording exceptional exceedes the corresponding checkpoint during weight threshold, wherein, institute
The probability of occurrence for stating the checkpoint of the weight of checkpoint based on exception determines.
It should be noted that when the problem to be updated occurs, the probability of occurrence of the checkpoint and the checkpoint
Weight computational methods it is as follows.If the problem to be updated 1 occurs 1000 times in the setting time section, the checkpoint
A occurs 654 times, and the checkpoint B occurs 252 times, and the checkpoint C occurs 94 times, then has the probability of occurrence of the checkpoint A
For 65.4%, the probability of occurrence of the checkpoint B is 25.2%, and the probability of occurrence of the checkpoint C is 9.4%, wherein, it is described
Checkpoint A weight is:65.4%/(65.4%+25.2%+9.4%)=65.4%, the weight of the checkpoint B are:
25.2%/(65.4%+25.2%+9.4%)=25.2%, the weight of the checkpoint C are:9.4%/(65.4%+25.2%
+ 9.4%)=9.4%.Certainly, other existing or the probability of occurrence of the checkpoint being likely to occur from now on and described inspections
The computational methods of the weight of point are such as applicable to the application, should also be included within the application protection domain, and herein with reference
Mode is incorporated herein.
Preferably, abnormal checkpoint bag is determined and recorded based on the monitoring data in the step S162
Include:Judge whether the monitoring data of the checkpoint exceeds outlier threshold;If being determined and recorded if described in corresponding exception
Checkpoint.
Check that rule is { problem 1 corresponding to the problem to be updated 1 in described problem database for example, obtaining:Checkpoint A
Outlier threshold be A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold is C1 and the checkpoint
Weight threshold is 10%, is treated more described in the cluster position information checked based on needs and review time section acquisition
Setting time section (t+ △ t) and (t+2 △ t) before the time of occurrence point t and time of occurrence point t of new problem 1
The monitoring data of all checkpoints corresponding to interior difference, and whether the monitoring data based on the checkpoint exceeds abnormal threshold
Value records the checkpoint of corresponding exception.If in the setting time section (t+ △ t) before the time of occurrence point t,
The monitoring data of the checkpoint obtained has respectively beyond its corresponding outlier threshold:Checkpoint A, checkpoint B and inspection
C is made an inventory of, wherein, the probability of occurrence of the checkpoint A is 65.4% in the setting time section (t+ △ t), the checkpoint B
Probability of occurrence be 25.2%, the probability of occurrence of the checkpoint C is 9.4%, and weight calculation is carried out based on the probability of occurrence,
Then the weight of the checkpoint is the probability of occurrence and the probability of occurrence sum of all checkpoints of each checkpoint
Ratio information, the weight for obtaining the checkpoint A in the setting time section (t+ △ t) are 65.4%, the checkpoint B's
Weight is 25.2%, and the weight of the checkpoint C is 9.4%, and the weight threshold based on the checkpoint is 10%, so in institute
State in setting time section (t+ △ t), the weight of the checkpoint of recording exceptional exceed it is corresponding during the weight threshold described in
Checkpoint is the checkpoint A and its weight 65.4% and the checkpoint B and its weight 25.2%.If in the time of occurrence
In setting time section (t+2 △ t) before point t, the monitoring data of the checkpoint of acquisition is beyond abnormal threshold corresponding to it
Value has respectively:Checkpoint A, checkpoint B and checkpoint D, wherein, the inspection in the setting time section (t+2 △ t)
Point A probability of occurrence is 50.5%, and the probability of occurrence of the checkpoint B is 1.4%, and the probability of occurrence of the checkpoint D is
48.1%, weight calculation is carried out based on the probability of occurrence, obtains the checkpoint A in the setting time section (t+2 △ t)
Weight be 50.5%, the weight of the checkpoint B is 1.4%, and the weight of the checkpoint D is 48.1%, based on the inspection
The weight threshold made an inventory of is 10%, so in the setting time section (t+2 △ t), the power of the checkpoint of recording exceptional
The corresponding checkpoint is the checkpoint A and its weight 50.5% and the checkpoint D when exceeding the weight threshold again
And its weight 48.1%.
Further, the step S163 includes:When the problem to be updated occurs in each setting time section,
Based on the abnormal checkpoint recorded in presently described setting time section and the abnormal checkpoint of historical record,
Update probability of occurrence of each checkpoint when the problem to be updated occurs;Specifically, the step S163 includes:
When the problem to be updated occurs in each setting time section, recorded based on presently described setting time section abnormal
The probability of occurrence of the checkpoint determines the present weight of the checkpoint in presently described setting time section;Based on the inspection
The present weight and the history weight of the abnormal checkpoint of historical record made an inventory of, each checkpoint is updated described
Probability of occurrence when problem to be updated occurs.
Then above-described embodiment of the application, if the problem to be updated 1 is when the preceding current setting of exception occurs in moment t
Between in section (t+ △ t), the weight of the checkpoint of recording exceptional the corresponding checkpoint is when exceeding the weight threshold
The checkpoint A and its present weight 65.4% and the checkpoint B and its present weight 25.2%, wherein, the checkpoint
Probability of occurrence of the present weight based on the checkpoint determine;If the problem to be updated 1 occurs in moment t, exception is preceding to be gone through
In setting time section described in history (t+2 △ t), the weight of the checkpoint of recording exceptional exceedes corresponding during the weight threshold
The checkpoint is the checkpoint A and its history weight 50.5% and the checkpoint D and its history weight 48.1%;Then base
Present weight and history weight in each checkpoint of the problem 1 to be updated, each checkpoint is updated in institute
Probability of occurrence when problem to be updated occurs is stated, that is, updates synthesis of each checkpoint when the problem to be updated occurs
Weight, wherein, the comprehensive weight of the checkpoint is the present weight of the checkpoint and the average of history weight, then has described
Checkpoint A comprehensive weight corresponding to problem to be updated is (65.4%+50.5%)/2=57.95%, and the checkpoint B's is comprehensive
Conjunction weight is (25.2%+1.4%)/2=13.3%, and the comprehensive weight of the checkpoint D is (0+48.1%)/2=
24.05%, then the history weight of the checkpoint of weight and historical record in the working days based on the checkpoint, renewal is each
Probability of occurrence of the checkpoint when the problem 1 to be updated occurs, that is, the probability of occurrence of the checkpoint A after updating are
57.95%, the checkpoint B after renewal probability of occurrence are 13.3%, and the probability of occurrence of the checkpoint D after renewal is
24.05%.
Then above-described embodiment of the application, the setting probability and the weight of the checkpoint in the step S164
The numerical value of threshold value is consistent, i.e., the described probability that sets is 10%, due to checkpoint A and its renewal corresponding to the problem 1 to be updated
Probability of occurrence 57.95% afterwards higher than it is described setting probability 10%, checkpoint B and its renewal after probability of occurrence 13.3% be higher than
Probability of occurrence 24.05% after the setting probability 10%, checkpoint D and its renewal then will higher than the setting probability 10%
The checkpoint C abandons from the inspection rule of the problem 1 to be updated in described problem rule base, by the checkpoint and
Its corresponding outlier threshold is added in the inspection rule of the problem to be updated in described problem rule base, and based on described in more
The probability of occurrence after new higher than the setting probability the checkpoint A, the checkpoint B and the checkpoint D and its
Relevant information, update the inspection rule of the problem to be updated.
Further, the relevant information of the checkpoint includes following at least any one:The monitoring data of the checkpoint
Outlier threshold, the weight of the checkpoint, wherein, probability of occurrence of the weight of the checkpoint based on the checkpoint is true
It is fixed.
Then above-described embodiment of the application, by the checkpoint A and its outlier threshold A1 of corresponding monitoring data and
The outlier threshold B1 and weight 13.3% of weight 57.95%, the checkpoint B and its corresponding monitoring data, and the inspection
The outlier threshold D1 and weight 24.05% of point D and its corresponding monitoring data enter as the inspection rule of the problem 1 to be updated
Row renewal.
With carrying out constantly health examination and obtaining healthy warning information to the distributed file system, and it is based on institute
State healthy early warning information carry out advanced processing during, user can get more than one to the distributed file system
Inspection result information after being handled, multiple checkpoints are actually had before a problem occurs, and appearance is abnormal in advance,
Then need the inspection result information based on acquisition, each checkpoint in the fixed time period before being occurred according to described problem
Abnormal monitoring data is iterated calculating, can most react corresponding inspection rule when described problem occurs abnormal to find, such as
Shown in Fig. 4.
Fig. 4 is shown according to right in a kind of method for checking cluster health status provided in the application another embodiment
That answers creates step S16 corresponding to problem base rule method flow schematic diagram.This method includes:Step S165, step S166,
Step S167 and step S168.
Wherein, the step S165 obtains problem to be updated, and obtains institute from least one inspection result information
State the time of occurrence point of problem to be updated;All institutes in setting time section before the step S166 acquisitions time of occurrence point
The monitoring data of checkpoint is stated, the abnormal checkpoint is determined and recorded based on the monitoring data;The step S167 bases
In the abnormal checkpoint recorded in presently described setting time section and the abnormal checkpoint of historical record, more
Probability of occurrence of the new each checkpoint when the problem to be updated occurs;The step S168 is based on described in after renewal
The checkpoint of the probability of occurrence higher than setting probability and its relevant information, update the inspection rule of the problem to be updated.
It should be noted that the inspection result information is during checking the distributed file system
The object information related to healthy early warning information obtained.The inspection result information includes at least any one of following:Occur different
Normal described problem, when there is the time of occurrence point of described problem, described problem occur it is corresponding occur abnormal checkpoint and its
Outlier threshold.Certainly, other described inspection result information that are existing or being likely to occur from now on are such as applicable to the application, also should
Within the application protection domain, and it is incorporated herein by reference herein.
It is first, described when being updated need in described problem rule base the problem of in embodiments herein
Step S165 obtains problem to be updated, and the appearance of the problem to be updated is obtained from least one inspection result information
Time point;Then, the step S166 obtains the prison of all checkpoints in the setting time section before the time of occurrence point
Data are controlled, monitoring data and its outlier threshold based on the checkpoint determine and record the abnormal checkpoint;Then, institute
Step S167 is stated based on the abnormal of the abnormal checkpoint and historical record recorded in presently described setting time section
The checkpoint, update probability of occurrence of each checkpoint when the problem to be updated occurs;Finally, the step
S168 is higher than the checkpoint and its relevant information of setting probability based on the probability of occurrence after renewal, is treated more described in renewal
The inspection rule of new problem, so as to update the problem to be updated by least one inspection result information of acquisition
Rule is checked to update described problem rule base so that described problem rule base being capable of the more comprehensively more accurate distributed text of reflection
Abnormal examination point in part system, and the monitoring of the health status of corresponding multiple checkpoints during to described problem occur is realized,
And improve the degree of accuracy and the real-time of the health status anticipation to each checkpoint corresponding to the described problem in cluster.
Fig. 5 shows a kind of device structure schematic diagram for being used to check cluster health status according to the application one side.
This method includes information acquisition device 11, Rule device 12, monitoring processing unit 13 and early warning feedback device 14.
Wherein, described information acquisition device 11 obtains the relevant information of cluster to be checked;The Rule device 12
Obtain the problem of at least one to be checked and its corresponding inspection rule;Monitor related letter of the processing unit 13 based on the cluster
Breath, obtains the monitoring data of the checkpoint related to the inspection rule from the cluster, and the monitoring data is carried out
Polymerization is handled to obtain result;Early warning feedback device 14 transfers corresponding described problem based on the result, and is based on
The relevant information of described problem generates and feeds back healthy early warning information.
Here, the equipment 1 includes but is not limited to user equipment or user equipment is integrated with the network equipment by network
The equipment formed.The user equipment its include but is not limited to any one and man-machine interaction can be carried out by touch pad with user
Mobile electronic product, such as smart mobile phone, PDA etc., the mobile electronic product can use any operating system, such as
Android operating systems, iOS operating systems etc..Wherein, the network equipment include one kind can be according to being previously set or store
Instruction, the automatic electronic equipment for carrying out numerical computations and information processing, its hardware includes but is not limited to microprocessor, special collection
Into circuit (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded device etc..The network is included but not
It is limited to internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN, wireless self-organization network (Ad Hoc networks) etc..Preferably,
The central schedule equipment can also be that run on the user equipment is integrated setting of being formed with the network equipment by network
Standby upper shell script.Certainly, those skilled in the art will be understood that above-mentioned central schedule equipment is only for example, and other are existing
Or the central schedule equipment being likely to occur from now on is such as applicable to the application, also should be included in the application protection domain with
It is interior, and be incorporated herein by reference herein.
Constantly worked between above-mentioned each device, here, it will be understood by those skilled in the art that " lasting " refers to
Each device is stated to require in real time or according to the mode of operation of setting or real-time adjustment respectively.
In embodiments herein, the cluster to be checked in described information acquisition device 11 is positioned at distribution
On one or more of file system clustered node, wherein, the distributed file system refers to the thing of file system management
Reason storage resource is not necessarily connected directly between local node, but is connected by computer network with node.Below in a distributed manner
Specific embodiment is carried out exemplified by file system to the application to explain in detail.Certainly, herein using distributed file system exemplified by
Specific embodiment is carried out to the application to explain in detail, purpose only by way of example, embodiments herein not limited to this,
Following embodiments can be equally realized in other distributed cluster systems.
Further, the checkpoint includes following at least any one:In hardware device, the cluster in the cluster
Software equipment local module.
It should be noted that the checkpoint in the monitoring processing unit 13 can including but not limited to include dividing
Software in the hardware device and distributed file system of the individual server under each clustered node in cloth file system is set
Standby local module.Wherein, the hardware device of the server includes central processing unit, internal memory, hard disk, core
Piece group, input/output bus, input-output equipment, power supply and cabinet etc., the local module of the software equipment include but unlimited
In system, program module, fault diagnostic program module and exception handles module etc. are set.Certainly, other are existing or from now on
The checkpoint being likely to occur such as is applicable to the application, should also be included within the application protection domain, and herein to draw
It is incorporated herein with mode.
Further, the request that described information acquisition device 11 is submitted based on user, the correlation of cluster to be checked is obtained
Information, wherein, the relevant information includes:Cluster position information and review time section.
In embodiments herein, obtained when needing and the health status of the distributed file system on line being monitored
The request of family submission is taken, the cluster position information of the cluster to be checked of the acquisition request based on user's submission and monitoring institute
State the review time section of cluster to be checked, wherein the cluster position information and the review time section belong to it is described to be checked
The relevant information for the cluster looked into.
For example, pair that user submits is obtained when needing to be monitored the health status of the distributed file system on line
The request that each checkpoint in cluster is monitored, where cluster to be checked corresponding to the acquisition request submitted based on user
Cluster position information and obtain multiple checkpoints monitoring data corresponding to one or more review time sections, wherein, institute
The actual geographic position range that cluster position information can be distributed across where the clustered node of different zones is stated, can also be same
Actual geographic position range where the clustered node of one region.
Further, the Rule device 12 obtained from problem rule base the problem of at least one to be checked and its
It is corresponding to check rule.
It should be noted that the described problem rule base in the Rule device 12 mainly includes what is had built up
Problem and its corresponding multiple inspection rules.Wherein, described problem includes RAM leakage, read-write long-tail, number
According to loss and system performance problems and system availability problem and service quality problem etc.;It is described to check that rule includes checkpoint
And its outlier threshold of corresponding monitoring data.Certainly, other existing or described problem rule bases for being likely to occur from now on are such as
The application is applicable to, should be also included within the application protection domain, and be incorporated herein by reference herein.
For example, the problem of described problem rule base is present is memory overflow, then check that rule includes corresponding to it:The inspection
The rate of change and its corresponding outlier threshold, the checkpoint made an inventory of as nearly one week traffic pressure are establishment file total amount and its right
The outlier threshold answered and the checkpoint are that internal memory uses growth slope and its corresponding outlier threshold;Described problem rule stock
The problem of for read-write long-tail, then check that rule includes corresponding to it:The checkpoint be nearly one week read-write calling frequency and
Its corresponding outlier threshold, the checkpoint are the retransmission rate of network and its corresponding outlier threshold and the checkpoint in cluster
For the disk health status score information in cluster and its corresponding outlier threshold.
Further, the monitoring processing unit 13 includes:Searching unit (not shown) and data capture unit (do not show
Go out), wherein, the searching unit (not shown), for based on cluster described in the cluster position information searching, and described in obtaining
In cluster regular related checkpoint is checked to described;The data capture unit (not shown), for the prison from the cluster
The monitoring data of the related checkpoint in the review time section is obtained in control module.
It should be noted that the monitoring module of the cluster is mainly responsible each from the monitoring system acquisition in the cluster
The monitoring data of each checkpoint of hardware device and software equipment correlation.Certainly, other are existing or are from now on likely to occur
The monitoring module of the cluster is such as applicable to the application, should also be included within the application protection domain, and herein with reference
Mode is incorporated herein.
In above-described embodiment of the application, if the cluster position information is upper in the searching unit (not shown)
Extra large geographical location information, then the Shanghai cluster is found based on the Shanghai geographical location information, and from the Shanghai cluster
Middle acquisition checks regular related each checkpoint to described;Prison of the data capture unit (not shown) from the Shanghai cluster
The monitoring data of related each checkpoint in the review time section is obtained in control module, then has the checkpoint got
Monitoring data for establishment file total amount is 34, the checkpoint be internal memory using increase slope monitoring data be 48%, institute
The monitoring data for stating the rate of change that checkpoint is nearly one week traffic pressure is 1%, and the read-write that the checkpoint is nearly one week is called
The monitoring data of frequency is 75.6%, and the checkpoint is that the monitoring data of the retransmission rate of network in cluster is 5.3%, the inspection
It is 100 to make an inventory of as the monitoring data of the disk health status score information in cluster.
The monitoring processing unit 13 includes:Data processing unit (not shown), wherein, the data processing unit is (not
Show), for based on it is described to be checked the problem of it is corresponding check rule, the monitoring data of each checkpoint is entered respectively
Row processing, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
, can be by being asked based on described to be checked in the data processing unit in above-described embodiment of the application
Rule is checked corresponding to topic, is respectively compared the monitoring datas of multiple checkpoints to judge whether the problem of described to be checked.
If the problem of distributed file system on anticipation line whether there is RAM leakage, can be by respectively to nearly one week business pressure
The rate of change of power, establishment file total amount, internal memory are carried out accordingly using monitoring data corresponding to these three checkpoints of slope is increased
The matching for checking rule, to obtain result to be prejudged;If whether the distributed file system on anticipation line deposits
, then can be by calling frequency to the read-write of nearly one week respectively read and write long-tail the problem of, the retransmission rate of network, cluster in cluster
In these three checkpoints of disk health status score information corresponding to monitoring data carry out corresponding for checking rule
It is equipped with and obtains result to be prejudged.
If for example, in the inspection rule of RAM leakage, the checkpoint is establishment file described to be checked the problem of
The outlier threshold of total amount is 30, because monitoring data that the checkpoint is establishment file total amount is 34 to have exceeded the abnormal threshold
Value 30, then the checkpoint is that establishment file total amount exception occurs;The checkpoint is that internal memory uses the abnormal threshold for increasing slope
It is worth for 20%, is 48% to exceed the outlier threshold because the checkpoint is internal memory using the monitoring data for increasing slope
20%, then the checkpoint is that internal memory exception occurs using slope is increased, and the checkpoint is the change of nearly one week traffic pressure
The outlier threshold of rate is 14%, because the monitoring data for the rate of change that the checkpoint is nearly one week traffic pressure is 1% to be less than
The outlier threshold 14%, then the checkpoint is normal for the rate of change of nearly one week traffic pressure;If asked in described to be checked
In the inspection rule of entitled read-write long-tail, the outlier threshold for the read-write calling frequency that the checkpoint is nearly one week is 30%, by
It is 75.6% to exceed the outlier threshold 30% that the monitoring data of frequency is called in the read-write for being nearly one week in the checkpoint, then institute
Stating the read-write that checkpoint is nearly one week calls frequency exception occur, and the checkpoint is the abnormal threshold of the retransmission rate of network in cluster
It is worth for 10%, because monitoring data that the checkpoint is the retransmission rate of network in cluster is 5.3% to be less than the outlier threshold
10%, then the checkpoint is that the retransmission rate of network in cluster is normal, and the checkpoint is the disk health status point in cluster
The outlier threshold of value information is 60, because the monitoring data that the checkpoint is the disk health status score information in cluster is
15 are less than the outlier threshold 60, then the checkpoint is that the disk health status score information in cluster is normal, therefore is obtained
Result the problem of being described to be checked be to check that the checkpoint in rule is establishment file corresponding to RAM leakage
There is abnormal, described checkpoint and the problem of abnormal, described to be checked occurs using slope is increased for read-write long-tail for internal memory in total amount
It is abnormal that the read-write that the corresponding checkpoint checked in rule is nearly one week calls frequency to occur.
In above-described embodiment of the application, the problem of described to be checked is based on correspondingly in the monitoring processing unit 13
Inspection rule the monitoring data of each checkpoint is handled respectively after, obtain corresponding result;Then, in institute
State in early warning feedback device 14, corresponding described problem is transferred based on the result, because the result is:It is described to treat
The problem of inspection is to check that the checkpoint in rule is that establishment file total amount abnormal, described inspection occurs corresponding to RAM leakage
Make an inventory of and the problem of abnormal, described to be checked occur using slope is increased to check the institute in rule corresponding to read-write long-tail for internal memory
Stating the read-write that checkpoint is nearly one week calls frequency exception occur, then it is RAM leakage and read-write length to transfer corresponding described problem
Tail, in the early warning feedback device 14, the relevant information based on described problem generates and feeds back healthy early warning information.
Further, the relevant information of described problem includes at least any one of following:It is the time of occurrence of described problem, each
The monitoring data of the related checkpoint, occur that the monitoring data abnormal checkpoint occurs during described problem.
Then above-described embodiment, the relevant information generation healthy early warning information based on described problem, then have the health pre-
Alert information includes described problem and its corresponding time of occurrence and each described of monitoring data exception occurs when there is described problem
Checkpoint and its monitoring data.
For example, it is based on the result:The problem of described to be checked, is in inspection rule corresponding to RAM leakage
The checkpoint is that exception occurs in establishment file total amount, the checkpoint is that internal memory is abnormal using slope appearance is increased, described to treat
The problem of inspection is to check that the read-write that the checkpoint in rule is nearly one week calls frequency appearance different corresponding to read-write long-tail
Often, then it is RAM leakage and read-write long-tail to transfer corresponding described problem;And based on the relevant information of described problem according to described point
Early warning report template in cloth file system generates and feeds back the healthy early warning information, wherein the healthy early warning generated
Information is { { RAM leakage:In t1, establishment file total amount is that 34 appearance are abnormal, and in t2, internal memory is gone out using slope 48% is increased
It is now abnormal };{ read-write long-tail:In t3, the read-write of nearly one week calls frequency 75.6% exception occur } }, to feed back to system maintenance
Personnel, to treat the healthy early warning information each inspection to the cluster under pinpoint the problems of the system maintenance personnel based on feedback
Point is given warning in advance and handles relevant health warning information, and more checkpoints are carried out to the distributed file system on line so as to improve
The real-time of monitoring, and reach the early warning to multiple checkpoints in the cluster in advance and handle the mesh of healthy early warning information
, also improve the degree of accuracy of the health status anticipation to each checkpoint corresponding to the described problem in cluster.
In the early warning feedback device 14, if the monitoring data of each checkpoint is all in all results
The outlier threshold is not above, then generates health status information, to treat that it is whole that the distributed file system attendant understands
Individual distributed file system is in health status, without carrying out healthy early warning processing.
In embodiments herein, described problem rule base is being utilized to each cluster in distributed sort system
Checkpoint monitoring data carry out polymerization calculating during, it is also necessary to described problem rule base carry out create and constantly
Renewal is as shown in Figure 6.
Fig. 6 shows corresponding wound in a kind of equipment for checking cluster health status according to the application another aspect
Build the device structure schematic diagram of problem rule base.The equipment 1 also includes creating rules device 15 and Policy Updates device 16.
Wherein, the establishment rules device 15 creates problem rule base, and described problem rule base includes at least one problem
And its corresponding inspection rule;The problem of Policy Updates device 16 is in described problem rule base and its corresponding inspection
Rule is updated.
In embodiments herein, the monitoring data of each checkpoint in the distributed file system is carried out
Before polymerization calculates, described problem rule base need to be created, wherein described problem rule base includes at least one problem and each institute
State and rule is checked corresponding to problem, in the exception for checking rule and including at least one checkpoint and each checkpoint
Threshold value, i.e., had before problem occurs multiple checkpoints occur it is abnormal, and based on occur abnormal multiple checkpoints come
There is abnormal corresponding described problem in anticipation.
For example, described problem rule base includes problem 1, problem 2 and problem 3, wherein, examined corresponding to described problem 1
It is { problem 1 to look into rule:Checkpoint A outlier threshold is A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold
For C1 };Check that rule is { problem 2 corresponding to described problem 2:Checkpoint D outlier threshold is D1, checkpoint E outlier threshold
Outlier threshold for E1 and checkpoint F is F1 };Check that rule is { problem 3 corresponding to described problem 3:Checkpoint G abnormal threshold
The outlier threshold being worth for G1 and checkpoint H is G1 }.
With exploding for user's mass data, the scale of the distributed file system is also constantly increasing, due to rule
It is actual before a problem occurs during the anticipation of the health status of distributed file system of the mould on ever-increasing line
Have multiple checkpoints and occur exception in advance, then need each inspection in the fixed time period before occurring according to described problem
The abnormal monitoring data of point is iterated calculating, can most react corresponding check gauge when described problem occurs abnormal to find
Then, as shown in Figure 3.
Fig. 7 is shown according to a kind of rule for being used to check in the equipment of cluster health status provided in the embodiment of the application one
The then structural representation of updating device 16.The Policy Updates device 16 includes:First information acquiring unit 161, the first record
First 162, first probability updating unit 163 and the first Policy Updates unit 164.
Wherein, the first information acquiring unit 161 obtain the relevant information of cluster to be checked, problem to be updated and its
Initial monitoring threshold value;First recording unit 162 is based on the initial monitoring threshold value, is obtained from the relevant information of the cluster
Take all checkpoints in the setting time section before the time of occurrence point and time of occurrence point of the problem to be updated
Monitoring data, the abnormal checkpoint is determined and recorded based on the monitoring data;The first probability updating unit 163 exists
When the problem to be updated occurs in each setting time section, recorded based on presently described setting time section abnormal
The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated
When probability of occurrence;The first Policy Updates unit 164 is based on institute of the probability of occurrence after renewal higher than setting probability
Checkpoint and its relevant information are stated, updates the inspection rule of the problem to be updated.
It is first, described when being updated need in described problem rule base the problem of in embodiments herein
First information acquiring unit 161 obtains the cluster position information and review time section of cluster to be checked and instructed
Experienced problem to be updated and its corresponding initial monitoring threshold value;Then, first recording unit 162 is based on the initial monitoring
Threshold value corresponding cluster under the cluster position information obtains the appearance of the problem to be updated in the review time section
The monitoring data of all checkpoints in setting time section before time point and time of occurrence point, and by monitoring data
There is the abnormal check point record to get off;Then, the first probability updating unit 163 is in each setting time section
During the interior generation problem to be updated, based on the abnormal checkpoint and history recorded in presently described setting time section
The abnormal checkpoint of record, update probability of occurrence of each checkpoint when the problem to be updated occurs;Most
Afterwards, the first Policy Updates unit 164 based on the probability of occurrence after renewal higher than setting probability the checkpoint and
Its relevant information, the inspection rule of the problem to be updated is updated, so as to the inspection rule by updating the problem to be updated
To update described problem rule base so that described problem rule base more comprehensively more accurately can reflect in distributed file system
Abnormal examination point, and realizing the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and improving
The degree of accuracy prejudged to the health status of each checkpoint corresponding to the described problem in cluster and real-time.
Further, the initial monitoring threshold value includes:The outlier threshold of the monitoring data of all checkpoints and go out
The now weight threshold of the abnormal checkpoint;First recording unit 162 is used for:Monitoring based on all checkpoints
The outlier threshold of data, obtained from the relevant information of the cluster problem to be updated time of occurrence point and it is described go out
The monitoring data of all checkpoints in setting time section before existing time point, and the weight of the checkpoint of recording exceptional
Corresponding checkpoint during more than the weight threshold, wherein, the checkpoint of the weight of the checkpoint based on exception
Probability of occurrence determine.
It should be noted that when the problem to be updated occurs, the probability of occurrence of the checkpoint and the checkpoint
Weight computational methods it is as follows.If the problem to be updated 1 occurs 1000 times in the setting time section, the checkpoint
A occurs 654 times, and the checkpoint B occurs 252 times, and the checkpoint C occurs 94 times, then has the probability of occurrence of the checkpoint A
For 65.4%, the probability of occurrence of the checkpoint B is 25.2%, and the probability of occurrence of the checkpoint C is 9.4%, wherein, it is described
Checkpoint A weight is:65.4%/(65.4%+25.2%+9.4%)=65.4%, the weight of the checkpoint B are:
25.2%/(65.4%+25.2%+9.4%)=25.2%, the weight of the checkpoint C are:9.4%/(65.4%+25.2%
+ 9.4%)=9.4%.Certainly, other existing or the probability of occurrence of the checkpoint being likely to occur from now on and described inspections
The computational methods of the weight of point are such as applicable to the application, should also be included within the application protection domain, and herein with reference
Mode is incorporated herein.
Preferably, first recording unit 162 includes:Judgment sub-unit (not shown) and record subelement (do not show
Go out), wherein, the judgment sub-unit (not shown), for judging whether the monitoring data of the checkpoint exceeds outlier threshold;
The record subelement (not shown), for if the checkpoint that corresponding exception is determined and recorded if.
Check that rule is { problem 1 corresponding to the problem to be updated 1 in described problem database for example, obtaining:Checkpoint A
Outlier threshold be A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold is C1 and the checkpoint
Weight threshold is 10%, is treated more described in the cluster position information checked based on needs and review time section acquisition
Setting time section (t+ △ t) and (t+2 △ t) before the time of occurrence point t and time of occurrence point t of new problem 1
The monitoring data of all checkpoints corresponding to interior difference, and whether the monitoring data based on the checkpoint exceeds abnormal threshold
Value records the checkpoint of corresponding exception.If in the setting time section (t+ △ t) before the time of occurrence point t,
The monitoring data of the checkpoint obtained has respectively beyond its corresponding outlier threshold:Checkpoint A, checkpoint B and inspection
C is made an inventory of, wherein, the probability of occurrence of the checkpoint A is 65.4% in the setting time section (t+ △ t), the checkpoint B
Probability of occurrence be 25.2%, the probability of occurrence of the checkpoint C is 9.4%, and weight calculation is carried out based on the probability of occurrence,
Then the weight of the checkpoint is the probability of occurrence and the probability of occurrence sum of all checkpoints of each checkpoint
Ratio information, the weight for obtaining the checkpoint A in the setting time section (t+ △ t) are 65.4%, the checkpoint B's
Weight is 25.2%, and the weight of the checkpoint C is 9.4%, and the weight threshold based on the checkpoint is 10%, so in institute
State in setting time section (t+ △ t), the weight of the checkpoint of recording exceptional exceed it is corresponding during the weight threshold described in
Checkpoint is the checkpoint A and its weight 65.4% and the checkpoint B and its weight 25.2%.If in the time of occurrence
In setting time section (t+2 △ t) before point t, the monitoring data of the checkpoint of acquisition is beyond abnormal threshold corresponding to it
Value has respectively:Checkpoint A, checkpoint B and checkpoint D, wherein, the inspection in the setting time section (t+2 △ t)
Point A probability of occurrence is 50.5%, and the probability of occurrence of the checkpoint B is 1.4%, and the probability of occurrence of the checkpoint D is
48.1%, weight calculation is carried out based on the probability of occurrence, obtains the checkpoint A in the setting time section (t+2 △ t)
Weight be 50.5%, the weight of the checkpoint B is 1.4%, and the weight of the checkpoint D is 48.1%, based on the inspection
The weight threshold made an inventory of is 10%, so in the setting time section (t+2 △ t), the power of the checkpoint of recording exceptional
The corresponding checkpoint is the checkpoint A and its weight 50.5% and the checkpoint D when exceeding the weight threshold again
And its weight 48.1%.
Further, the first probability updating unit 163 includes:Weight determination subelement (not shown) and probability updating
Subelement (not shown), wherein, the weight determination subelement (not shown), for occurring in each setting time section
During the problem to be updated, the probability of occurrence of the abnormal checkpoint recorded based on presently described setting time section is determined
The present weight of the checkpoint in presently described setting time section;The probability updating subelement (not shown), for base
In the checkpoint present weight and historical record the abnormal checkpoint history weight, update each inspection
Probability of occurrence of the point when the problem to be updated occurs.
Then above-described embodiment of the application, if the problem to be updated 1 is when the preceding current setting of exception occurs in moment t
Between in section (t+ △ t), the weight of the checkpoint of recording exceptional the corresponding checkpoint is when exceeding the weight threshold
The checkpoint A and its present weight 65.4% and the checkpoint B and its present weight 25.2%, wherein, the checkpoint
Probability of occurrence of the present weight based on the checkpoint determine;If the problem to be updated 1 occurs in moment t, exception is preceding to be gone through
In setting time section described in history (t+2 △ t), the weight of the checkpoint of recording exceptional exceedes corresponding during the weight threshold
The checkpoint is the checkpoint A and its history weight 50.5% and the checkpoint D and its history weight 48.1%;Then base
Present weight and history weight in each checkpoint of the problem 1 to be updated, each checkpoint is updated in institute
Probability of occurrence when problem to be updated occurs is stated, that is, updates synthesis of each checkpoint when the problem to be updated occurs
Weight, wherein, the comprehensive weight of the checkpoint is the present weight of the checkpoint and the average of history weight, then has described
Checkpoint A comprehensive weight corresponding to problem to be updated is (65.4%+50.5%)/2=57.95%, and the checkpoint B's is comprehensive
Conjunction weight is (25.2%+1.4%)/2=13.3%, and the comprehensive weight of the checkpoint D is (0+48.1%)/2=
24.05%, then the history weight of the checkpoint of weight and historical record in the working days based on the checkpoint, renewal is each
Probability of occurrence of the checkpoint when the problem 1 to be updated occurs, that is, the probability of occurrence of the checkpoint A after updating are
57.95%, the checkpoint B after renewal probability of occurrence are 13.3%, and the probability of occurrence of the checkpoint D after renewal is
24.05%.
Then above-described embodiment of the application, the setting probability and the weight of the checkpoint in the step S164
The numerical value of threshold value is consistent, i.e., the described probability that sets is 10%, due to checkpoint A and its renewal corresponding to the problem 1 to be updated
Probability of occurrence 57.95% afterwards higher than it is described setting probability 10%, checkpoint B and its renewal after probability of occurrence 13.3% be higher than
Probability of occurrence 24.05% after the setting probability 10%, checkpoint D and its renewal then will higher than the setting probability 10%
The checkpoint C abandons from the inspection rule of the problem 1 to be updated in described problem rule base, by the checkpoint and
Its corresponding outlier threshold is added in the inspection rule of the problem to be updated in described problem rule base, and based on described in more
The probability of occurrence after new higher than the setting probability the checkpoint A, the checkpoint B and the checkpoint D and its
Relevant information, update the inspection rule of the problem to be updated.
Further, the relevant information of the checkpoint includes following at least any one:The monitoring data of the checkpoint
Outlier threshold, the weight of the checkpoint, wherein, probability of occurrence of the weight of the checkpoint based on the checkpoint is true
It is fixed.
Then above-described embodiment of the application, by the checkpoint A and its outlier threshold A1 of corresponding monitoring data and
The outlier threshold B1 and weight 13.3% of weight 57.95%, the checkpoint B and its corresponding monitoring data, and the inspection
The outlier threshold D1 and weight 24.05% of point D and its corresponding monitoring data enter as the inspection rule of the problem 1 to be updated
Row renewal.
With carrying out constantly health examination and obtaining healthy warning information to the distributed file system, and it is based on institute
State healthy early warning information carry out advanced processing during, user can get more than one to the distributed file system
Inspection result information after being handled, multiple checkpoints are actually had before a problem occurs, and appearance is abnormal in advance,
Then need the inspection result information based on acquisition, each checkpoint in the fixed time period before being occurred according to described problem
Abnormal monitoring data is iterated calculating, can most react corresponding inspection rule when described problem occurs abnormal to find, such as
Shown in Fig. 8.
Fig. 8 is shown according in a kind of equipment for checking cluster health status provided in the application another embodiment
The structural representation of Policy Updates device 16.The Policy Updates device 16 includes:Second information acquisition unit 165, second records
Unit 166, the second probability updating unit 167 and Second Rule updating block 168.
Wherein, second information acquisition unit 165 obtains problem to be updated, and believes from least one inspection result
The time of occurrence point of the problem to be updated is obtained in breath;Before second recording unit 166 obtains the time of occurrence point
The monitoring data of all checkpoints in setting time section, the abnormal inspection is determined and recorded based on the monitoring data
Point;The second probability updating unit 167 based on the abnormal checkpoint that is recorded in presently described setting time section with
The abnormal checkpoint of historical record, it is general to update appearance of each checkpoint when the problem to be updated occurs
Rate;The Second Rule updating block 168 based on the probability of occurrence after renewal higher than setting probability the checkpoint and
Its relevant information, update the inspection rule of the problem to be updated.
It should be noted that the inspection result information includes at least following any one:There is abnormal described problem, go out
The time of occurrence point of existing described problem, there is the corresponding checkpoint for exception occur and its outlier threshold during described problem.Certainly, its
He is such as applicable to the application at the inspection result information that is existing or being likely to occur from now on, should also be included in the application and protect
Within scope, and it is incorporated herein by reference herein.
It is first, described when being updated need in described problem rule base the problem of in embodiments herein
Second information acquisition unit 165 obtains problem to be updated, and is treated more described in acquisition from least one inspection result information
The time of occurrence point of new problem;Then, second recording unit 166 is obtained in the setting time section before the time of occurrence point
The monitoring data of all checkpoints, monitoring data and its outlier threshold based on the checkpoint determine and record abnormal
The checkpoint;Then, the second probability updating unit 167 is abnormal based on being recorded in presently described setting time section
The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated
When probability of occurrence;Finally, the Second Rule updating block 168 is higher than setting probability based on the probability of occurrence after renewal
The checkpoint and its relevant information, the inspection rule of the problem to be updated is updated, so as to pass through at least one of acquisition
The inspection result information updates the inspection rule of the problem to be updated to update described problem rule base so that described to ask
The abnormal examination point that rule base more comprehensively more accurately can reflect in distributed file system is inscribed, and is realized to being asked described in appearance
The monitoring of the health status of corresponding multiple checkpoints during topic, and improve to each inspection corresponding to the described problem in cluster
The degree of accuracy of the health status anticipation of point and real-time.
In addition, present invention also provides a kind of equipment for checking cluster health status, including:
Processor;
And it is arranged to store the memory of computer executable instructions, the executable instruction makes institute when executed
State processor:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the prison for checking regular related checkpoint
Data are controlled, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back strong
Health warning information.
Compared with prior art, a kind of of embodiments herein offer is used to check the method for cluster health status and set
It is standby, by the relevant information for obtaining cluster to be checked;Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, the monitoring data to the regular related checkpoint of the inspection is obtained from the cluster,
And polymerization processing is carried out to the monitoring data to obtain result;Corresponding described problem is transferred based on the result,
And the relevant information based on described problem generates and feeds back healthy early warning information.Because the distributed file system on to line is entered
Before the anticipation of row health status, will occur the problem of abnormal to be checked on the line of distributed file system as far as possible is carried out correspondingly
Regularization corresponding check rule to obtain the problem of described to be checked so that distributed file system on to line is carried out
When health status prejudges, monitoring data corresponding to each checkpoint can be directly obtained, and check rule to the inspection using described
The monitoring data made an inventory of carries out result of the polymerization processing to be handled, and improves and multiple inspections under each clustered node are clicked through
The degree of accuracy of row health status monitoring, and corresponding described problem, and the related letter based on described problem are transferred based on result
Breath generates and feeds back healthy early warning information, to treat the healthy early warning information of the attendant based on feedback under each clustered node
The each checkpoint pinpointed the problems gives warning in advance and handles relevant health warning information, so as to improve to the distribution text on line
Part system carries out the real-time of more checkpoint monitoring, and reaches the purpose of Multiple Dots Alarm Rules in advance;Further, to the monitoring number
Included according to polymerization processing is carried out with obtaining result:Based on it is described to be checked the problem of it is corresponding check rule, to each institute
The monitoring data for stating checkpoint is respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and anti-
Result being presented, realizing the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and is improved pair
The degree of accuracy of the health status anticipation of each checkpoint corresponding to described problem in cluster.
Further, a kind of method and apparatus for checking cluster health status that embodiments herein provides, also
By creating problem rule base, described problem rule base includes at least one problem and its corresponding inspection rule;Ask described
The problem of in topic rule base and its corresponding inspection rule are updated, and ensure that to most in the distributed file system on line
The each checkpoint for being likely to occur problem check the establishment of rule, and the monitoring data based on each checkpoint is to described problem
The problem of in rule base and its corresponding inspection rule are updated, with ensure create the problem of rule base in can be more comprehensively
Abnormal examination point in more accurate reaction profile formula file system, and realize corresponding multiple inspections during to described problem occur
The monitoring of the health status of point, and improve the health status anticipation to each checkpoint corresponding to the described problem in cluster
The degree of accuracy and real-time.
Further, in described problem rule base the problem of and its corresponding inspection rule be updated including:Obtain
Take the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;Based on the initial monitoring threshold value, from institute
State the setting time before time of occurrence point and time of occurrence point that the problem to be updated is obtained in the relevant information of cluster
The monitoring data of all checkpoints, the abnormal checkpoint is determined and recorded based on the monitoring data in section;Every
It is abnormal based on being recorded in presently described setting time section when the problem to be updated occurring in the individual setting time section
The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated
When probability of occurrence;It is higher than the checkpoint and its relevant information of setting probability based on the probability of occurrence after renewal, more
The inspection rule of the new problem to be updated so that pass through the setting time section before the time of occurrence point to the problem to be updated
The monitoring data of interior all checkpoints, prejudged based on the initial monitoring threshold value, and in each setting
Between when the problem to be updated occurring in section, based on the abnormal checkpoint recorded in presently described setting time section with
The abnormal checkpoint of historical record, it is general to update appearance of each checkpoint when the problem to be updated occurs
Rate, so as to which the probability of occurrence after renewal to be higher than to the checkpoint and its relevant information of setting probability, treated described in renewal
The inspection rule of replacement problem, so as to update described problem rule base by updating the inspection rule of the problem to be updated,
The abnormal examination point that described problem rule base is more comprehensively more accurately reflected in distributed file system, and realization pair
There is the monitoring of the health status of corresponding multiple checkpoints during described problem, and improve corresponding to the described problem in cluster
Each checkpoint health status anticipation the degree of accuracy and real-time.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt
With application specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment
In, the software program of the application can realize steps described above or function by computing device.Similarly, the application
Software program (including related data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory,
Magnetically or optically driver or floppy disc and similar devices.In addition, some steps or function of the application can employ hardware to realize, example
Such as, coordinate as with processor so as to perform the circuit of each step or function.
In addition, the part of the application can be applied to computer program product, such as computer program instructions, when its quilt
When computer performs, by the operation of the computer, it can call or provide according to the present processes and/or technical scheme.
And the programmed instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through
Broadcast or the data flow in other signal bearing medias and be transmitted, and/or be stored according to described program instruction operation
In the working storage of computer equipment.Here, including a device according to one embodiment of the application, the device includes using
Memory in storage computer program instructions and processor for execute program instructions, wherein, when the computer program refers to
When order is by the computing device, method and/or skill of the plant running based on foregoing multiple embodiments according to the application are triggered
Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, scope of the present application is by appended power
Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the application.Any reference in claim should not be considered as to the involved claim of limitation.This
Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple
Unit or device can also be realized by a unit or device by software or hardware.The first, the second grade word is used for table
Show title, and be not offered as any specific order.
Claims (29)
1. a kind of method for checking cluster health status, wherein, methods described includes:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the monitoring number for checking regular related checkpoint
According to, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back healthy pre-
Alert information.
2. the method according to claim 11, wherein, it is described to obtain the problem of at least one to be checked and its corresponding inspection
Rule includes:
The problem of at least one to be checked and its corresponding inspection rule are obtained from problem rule base.
3. according to the method for claim 1, wherein, the relevant information for obtaining cluster to be checked includes:
The request submitted based on user, the relevant information of cluster to be checked is obtained, wherein, the relevant information includes:Cluster
Positional information and review time section.
4. according to the method for claim 3, wherein, described obtained from the cluster checks regular related inspection to described
The monitoring data made an inventory of includes:
Based on cluster described in the cluster position information searching, and obtain in the cluster and check regular related inspection to described
Point;
The monitoring data of the related checkpoint in the review time section is obtained from the monitoring module of the cluster.
5. the method according to claim 11, wherein, it is described that polymerization processing is carried out to the monitoring data to obtain processing knot
Fruit includes:
Based on it is described to be checked the problem of it is corresponding check that rule is respectively processed to the monitoring data of each checkpoint,
To obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
6. according to the method for claim 1, wherein, the relevant information of described problem includes at least following any one:
The time of occurrence of described problem, the monitoring data of each correlation checkpoint, monitoring number occurs when there is described problem
According to the checkpoint of exception.
7. according to the method for claim 1, wherein, methods described also includes:
Establishment problem rule base, described problem rule base include at least one problem and its corresponding inspection rule;
The problem of in described problem rule base and its corresponding inspection rule are updated.
8. according to the method for claim 7, wherein, the problem of in the rule base to described problem and its corresponding inspection
Look into rule be updated including:
Obtain the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;
Based on the initial time of occurrence for monitoring threshold value, the problem to be updated being obtained from the relevant information of the cluster
The monitoring data of all checkpoints in setting time section before point and time of occurrence point, based on the monitoring data
Determine and record the abnormal checkpoint;
When the problem to be updated occurs in each setting time section, based on being recorded in presently described setting time section
The abnormal checkpoint and historical record the abnormal checkpoint, update each checkpoint described to be updated
Probability of occurrence when problem occurs;
Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, treated more described in renewal
The inspection rule of new problem.
9. according to the method for claim 8, wherein, the initial monitoring threshold value includes:The monitoring of all checkpoints
The outlier threshold of data and the weight threshold for the abnormal checkpoint occur;
It is described initially to monitor threshold value based on described, when the appearance of the problem to be updated is obtained from the relevant information of the cluster
Between in setting time section before point and time of occurrence point all checkpoints monitoring data, based on the monitoring number
Include according to the checkpoint for determining and recording abnormal:
The outlier threshold of monitoring data based on all checkpoints, treated more described in acquisition from the relevant information of the cluster
The monitoring data of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of new problem,
And the weight of the checkpoint of recording exceptional exceedes the corresponding checkpoint during weight threshold, wherein, the inspection
The probability of occurrence of the checkpoint of the weight of point based on exception determines.
10. according to the method for claim 8, wherein, it is described based on the monitoring data determine and record it is abnormal described in
Checkpoint includes:
Judge whether the monitoring data of the checkpoint exceeds outlier threshold;
If the checkpoint that corresponding exception is determined and recorded if.
11. according to the method for claim 7, wherein, it is the problem of in the rule base to described problem and its corresponding
Check rule be updated including:
Problem to be updated is obtained, and the time of occurrence of the problem to be updated is obtained from least one inspection result information
Point;
The monitoring data of all checkpoints in the setting time section before the time of occurrence point is obtained, based on the monitoring number
According to determining and recording the abnormal checkpoint;
Based on the abnormal checkpoint recorded in presently described setting time section and the abnormal inspection of historical record
Make an inventory of, update probability of occurrence of each checkpoint when the problem to be updated occurs;
Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, treated more described in renewal
The inspection rule of new problem.
12. the method according to any one of claim 8 to 11, wherein, the relevant information of the checkpoint is including following
Any one of at least:
The outlier threshold of the monitoring data of the checkpoint, the weight of the checkpoint, wherein, the weight of the checkpoint is based on
The probability of occurrence of the checkpoint determines.
13. the method according to any one of claim 8 to 12, wherein, it is described to be sent out in each setting time section
During the raw problem to be updated, based on the abnormal checkpoint and historical record recorded in presently described setting time section
The abnormal checkpoint, updating the probability of occurrence of each checkpoint when the problem to be updated occurs includes:
When the problem to be updated occurs in each setting time section, recorded based on presently described setting time section
The probability of occurrence of the abnormal checkpoint determines the present weight of the checkpoint in presently described setting time section;
The history weight of the abnormal checkpoint of present weight and historical record based on the checkpoint, updates each institute
State probability of occurrence of the checkpoint when the problem to be updated occurs.
14. the method according to any one of claim 1 to 13, wherein, the checkpoint includes following at least any one:
The local module of hardware device in the cluster, the software equipment in the cluster.
15. a kind of equipment for checking cluster health status, wherein, the equipment includes:
Information acquisition device, for obtaining the relevant information of cluster to be checked;
Rule device, for obtaining the problem of at least one to be checked and its corresponding inspection rule;
Processing unit is monitored, for the relevant information based on the cluster, is obtained from the cluster and checks regular phase with described
The monitoring data of the checkpoint of pass, and polymerization processing is carried out to the monitoring data to obtain result;
Early warning feedback device, for transferring corresponding described problem, and the related letter based on described problem based on the result
Breath generates and feeds back healthy early warning information.
16. equipment according to claim 15, wherein, the Rule device is used for:
The problem of at least one to be checked and its corresponding inspection rule are obtained from problem rule base.
17. equipment according to claim 15, wherein, described information acquisition device is used for:
The request submitted based on user, the relevant information of cluster to be checked is obtained, wherein, the relevant information includes:Cluster
Positional information and review time section.
18. equipment according to claim 17, wherein, the monitoring processing unit includes:
Searching unit, for based on cluster described in the cluster position information searching, and obtain in the cluster with the inspection
Regular related checkpoint;
Data capture unit, for obtaining the related checkpoint in the review time section from the monitoring module of the cluster
Monitoring data.
19. equipment according to claim 15, wherein, the monitoring processing unit includes:
Data processing unit, for based on it is described to be checked the problem of corresponding check monitoring of the rule to each checkpoint
Data are respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
20. equipment according to claim 15, wherein, the relevant information of described problem includes at least following any one:
The time of occurrence of described problem, the monitoring data of each correlation checkpoint, monitoring number occurs when there is described problem
According to the checkpoint of exception.
21. equipment according to claim 15, wherein, the equipment also includes:
Rules device is created, for creating problem rule base, described problem rule base includes at least one problem and its corresponding
Check rule;
Policy Updates device, the problem of in described problem rule base and its corresponding inspection rule is updated.
22. equipment according to claim 21, wherein, the Policy Updates device includes:
First information acquiring unit, for obtaining the relevant information of cluster to be checked, problem to be updated and its initially monitoring threshold
Value;
First recording unit, for based on the initial monitoring threshold value, being treated more described in acquisition from the relevant information of the cluster
The monitoring data of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of new problem,
The abnormal checkpoint is determined and recorded based on the monitoring data;
First probability updating unit, during for the problem to be updated to occur in each setting time section, based on current
The abnormal checkpoint recorded in the setting time section and the abnormal checkpoint of historical record, renewal are each
Probability of occurrence of the checkpoint when the problem to be updated occurs;
First Policy Updates unit, for based on the probability of occurrence after renewal higher than setting probability the checkpoint and its
Relevant information, update the inspection rule of the problem to be updated.
23. equipment according to claim 22, wherein, the initial monitoring threshold value includes:The prison of all checkpoints
Control the outlier threshold of data and the weight threshold of the abnormal checkpoint occur;
First recording unit is used for:
The outlier threshold of monitoring data based on all checkpoints, treated more described in acquisition from the relevant information of the cluster
The monitoring data of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of new problem,
And the weight of the checkpoint of recording exceptional exceedes the corresponding checkpoint during weight threshold, wherein, the inspection
The probability of occurrence of the checkpoint of the weight of point based on exception determines.
24. equipment according to claim 22, wherein, first recording unit includes:
Judgment sub-unit, for judging whether the monitoring data of the checkpoint exceeds outlier threshold;
Subelement is recorded, for if the checkpoint that corresponding exception is determined and recorded if.
25. equipment according to claim 21, wherein, the Policy Updates device includes:
Second information acquisition unit, for obtaining problem to be updated, and institute is obtained from least one inspection result information
State the time of occurrence point of problem to be updated;
Second recording unit, for obtaining the monitoring number of all checkpoints in the setting time section before the time of occurrence point
According to determining and recording the abnormal checkpoint based on the monitoring data;
Second probability updating unit, for based on the abnormal checkpoint recorded in presently described setting time section with going through
The abnormal checkpoint of Records of the Historian record, update probability of occurrence of each checkpoint when the problem to be updated occurs;
Second Rule updating block, for based on the probability of occurrence after renewal higher than setting probability the checkpoint and its
Relevant information, update the inspection rule of the problem to be updated.
26. the equipment according to any one of claim 22 to 25, wherein, the relevant information of the checkpoint is including following
Any one of at least:
The outlier threshold of the monitoring data of the checkpoint, the weight of the checkpoint, wherein, the weight of the checkpoint is based on
The probability of occurrence of the checkpoint determines.
27. the equipment according to any one of claim 22 to 26, wherein, the first probability updating unit includes:
Weight determination subelement, during for the problem to be updated to occur in each setting time section, based on current institute
State the abnormal checkpoint that setting time section is recorded probability of occurrence determine it is described in presently described setting time section
The present weight of checkpoint;
Probability updating subelement, for the abnormal checkpoint of the present weight based on the checkpoint and historical record
History weight, update probability of occurrence of each checkpoint when the problem to be updated occurs.
28. the equipment according to any one of claim 15 to 27, wherein, the checkpoint includes following at least any
:
The local module of hardware device in the cluster, the software equipment in the cluster.
29. a kind of equipment for checking cluster health status, wherein, including:
Processor;
And it is arranged to store the memory of computer executable instructions, the executable instruction makes the place when executed
Manage device:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the monitoring number for checking regular related checkpoint
According to, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back healthy pre-
Alert information.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610194499 | 2016-03-31 | ||
CN2016101944993 | 2016-03-31 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107391335A true CN107391335A (en) | 2017-11-24 |
CN107391335B CN107391335B (en) | 2021-09-03 |
Family
ID=60338371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710205541.1A Active CN107391335B (en) | 2016-03-31 | 2017-03-31 | Method and equipment for checking health state of cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107391335B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255676A (en) * | 2018-01-15 | 2018-07-06 | 南京市城市规划编制研究中心 | A kind of monitoring method of software systems client health degree |
CN108874640A (en) * | 2018-05-07 | 2018-11-23 | 北京京东尚科信息技术有限公司 | A kind of appraisal procedure and device of clustering performance |
CN109376043A (en) * | 2018-10-18 | 2019-02-22 | 郑州云海信息技术有限公司 | A kind of method and apparatus of equipment monitoring |
CN110069393A (en) * | 2019-03-11 | 2019-07-30 | 北京互金新融科技有限公司 | Detection method, device, storage medium and the processor of software environment |
CN110278133A (en) * | 2019-07-31 | 2019-09-24 | 中国工商银行股份有限公司 | Inspection method, device, calculating equipment and the medium executed by server |
CN113645525A (en) * | 2021-08-09 | 2021-11-12 | 中国工商银行股份有限公司 | Method, device, equipment and storage medium for checking running state of optical fiber switch |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101123521A (en) * | 2006-08-07 | 2008-02-13 | 华为技术有限公司 | A management method for check points in cluster |
US20090132864A1 (en) * | 2005-10-28 | 2009-05-21 | Garbow Zachary A | Clustering process for software server failure prediction |
US20120254669A1 (en) * | 2011-04-04 | 2012-10-04 | Microsoft Corporation | Proactive failure handling in database services |
CN102957563A (en) * | 2011-08-16 | 2013-03-06 | 中国石油化工股份有限公司 | Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system |
CN104917627A (en) * | 2015-01-20 | 2015-09-16 | 杭州安恒信息技术有限公司 | Log cluster scanning and analysis method used for large-scale server cluster |
CN104954181A (en) * | 2015-06-08 | 2015-09-30 | 北京集奥聚合网络技术有限公司 | Method for warning faults of distributed cluster devices |
-
2017
- 2017-03-31 CN CN201710205541.1A patent/CN107391335B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090132864A1 (en) * | 2005-10-28 | 2009-05-21 | Garbow Zachary A | Clustering process for software server failure prediction |
CN101123521A (en) * | 2006-08-07 | 2008-02-13 | 华为技术有限公司 | A management method for check points in cluster |
US20120254669A1 (en) * | 2011-04-04 | 2012-10-04 | Microsoft Corporation | Proactive failure handling in database services |
CN102957563A (en) * | 2011-08-16 | 2013-03-06 | 中国石油化工股份有限公司 | Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system |
CN104917627A (en) * | 2015-01-20 | 2015-09-16 | 杭州安恒信息技术有限公司 | Log cluster scanning and analysis method used for large-scale server cluster |
CN104954181A (en) * | 2015-06-08 | 2015-09-30 | 北京集奥聚合网络技术有限公司 | Method for warning faults of distributed cluster devices |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255676A (en) * | 2018-01-15 | 2018-07-06 | 南京市城市规划编制研究中心 | A kind of monitoring method of software systems client health degree |
CN108874640A (en) * | 2018-05-07 | 2018-11-23 | 北京京东尚科信息技术有限公司 | A kind of appraisal procedure and device of clustering performance |
CN109376043A (en) * | 2018-10-18 | 2019-02-22 | 郑州云海信息技术有限公司 | A kind of method and apparatus of equipment monitoring |
CN110069393A (en) * | 2019-03-11 | 2019-07-30 | 北京互金新融科技有限公司 | Detection method, device, storage medium and the processor of software environment |
CN110278133A (en) * | 2019-07-31 | 2019-09-24 | 中国工商银行股份有限公司 | Inspection method, device, calculating equipment and the medium executed by server |
CN110278133B (en) * | 2019-07-31 | 2021-08-13 | 中国工商银行股份有限公司 | Checking method, device, computing equipment and medium executed by server |
CN113645525A (en) * | 2021-08-09 | 2021-11-12 | 中国工商银行股份有限公司 | Method, device, equipment and storage medium for checking running state of optical fiber switch |
Also Published As
Publication number | Publication date |
---|---|
CN107391335B (en) | 2021-09-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107391335A (en) | A kind of method and apparatus for checking cluster health status | |
US9275353B2 (en) | Event-processing operators | |
US9547970B2 (en) | Context-aware wearable safety system | |
US11392469B2 (en) | Framework for testing machine learning workflows | |
CN109905269A (en) | The method and apparatus for determining network failure | |
CN110069551A (en) | Medical Devices O&M information excavating analysis system and its application method based on Spark | |
JP2009070071A (en) | Learning process abnormality diagnostic device and operator's judgement estimation result collecting device | |
JP2012009064A (en) | Learning type process abnormality diagnosis device and operator determination assumption result collection device | |
US20160048805A1 (en) | Method of collaborative software development | |
Goel et al. | A data-driven alarm and event management framework | |
Jafarian-Namin et al. | An integrated quality, maintenance and production model based on the delayed monitoring under the ARMA control chart | |
DE112019005467T5 (en) | SYSTEM AND METHOD OF DETECTING AND PREDICTING PATTERNS OF ANOMALY SENSOR BEHAVIOR OF A MACHINE | |
Pan et al. | Google trends analysis of covid-19 pandemic | |
CN114118507A (en) | Risk assessment early warning method and device based on multi-dimensional information fusion | |
Arakelian et al. | Creation of predictive analytics system for power energy objects | |
Swiecki et al. | Does order matter? investigating sequential and cotemporal models of collaboration | |
US11887465B2 (en) | Methods, systems, and computer programs for alarm handling | |
US20180285758A1 (en) | Methods for creating and analyzing dynamic trail networks | |
David et al. | Toward the incorporation of temporal interaction analysis techniques in modeling and understanding sociotechnical systems | |
Borissova et al. | A concept of intelligent e-maintenance decision making system | |
CN116545867A (en) | Method and device for monitoring abnormal performance index of network element of communication network | |
Pegoraro | Process mining on uncertain event data | |
TWM592123U (en) | Intelligent system for inferring system or product quality abnormality | |
CN114676021A (en) | Job log monitoring method and device, computer equipment and storage medium | |
JP2013182471A (en) | Load evaluation device for plant operation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |