CN107391335A - A kind of method and apparatus for checking cluster health status - Google Patents

A kind of method and apparatus for checking cluster health status Download PDF

Info

Publication number
CN107391335A
CN107391335A CN201710205541.1A CN201710205541A CN107391335A CN 107391335 A CN107391335 A CN 107391335A CN 201710205541 A CN201710205541 A CN 201710205541A CN 107391335 A CN107391335 A CN 107391335A
Authority
CN
China
Prior art keywords
checkpoint
cluster
occurrence
updated
monitoring data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710205541.1A
Other languages
Chinese (zh)
Other versions
CN107391335B (en
Inventor
曹锋
林江彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Publication of CN107391335A publication Critical patent/CN107391335A/en
Application granted granted Critical
Publication of CN107391335B publication Critical patent/CN107391335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The purpose of the application is to provide a kind of method and apparatus for checking cluster health status, by the relevant information for obtaining cluster to be checked;Obtain the problem of at least one to be checked and its corresponding inspection rule;Based on the relevant information of the cluster, the monitoring data of the checkpoint related to the inspection rule is obtained from the cluster, and polymerization processing is carried out to the monitoring data to obtain result;Corresponding described problem is transferred based on the result, and the relevant information based on described problem generates and feeds back healthy early warning information, realize the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and improve the degree of accuracy of the health status anticipation to each checkpoint corresponding to the described problem in cluster, the real-time that more checkpoint monitoring are carried out to the distributed file system on line is also improved simultaneously, and reaches the purpose for carrying out being alarmed multiple checkpoints in advance.

Description

A kind of method and apparatus for checking cluster health status
Technical field
The application is related to computer realm, more particularly to a kind of technology for being used to check cluster health status.
Background technology
In the distributed type assemblies warning system, exploded with the mass data of user equipment, distributed file system The scale of (Distributed File System) is also constantly increasing;It is but old with cluster where distributed file system Change and the continuous growth of business, various problems emerge in an endless stream, and the list that often individual server in a clustered node occurs Point problem is likely to accumulate and cause very big failure;But carried out when problem happens suddenly by the platform where warning system Alarm, with wake up attendant investigated and perform the method solved the problems, such as may because of miss solve the problems, such as it is optimal when Between and trigger failure.
In the prior art, distributed type assemblies warning system is respectively to the hardware of the single service equipment under each clustered node (for example, local module in internal memory, hard disk or software entity) and operating system carry out single-point alarm, when single-point goes wrong Alarmed, and by largely alarm simple abnormal alarm information is carried out by service equipment acquisition after unified alarm to safeguarding Personnel.Because distributed type assemblies warning system of the prior art is only alarmed when single-point goes wrong, therefore before alarm If alarm threshold value set pine to be likely to result in triggering failure, and alarm threshold value set and can sternly cause largely to report by mistake;And by Reported in distributed type assemblies warning system of the prior art mainly for the hardware of service equipment and the single-point of operating system It is alert, availability, performance and service quality of distributed file system etc. are not judged, with causing one-sidedness to whole Distributed file system is alarmed, and causes the alarm degree of accuracy low;Again due to distributed type assemblies warning system of the prior art Only it is that simply substantial amounts of abnormal alarm acquisition of information and unification are alarmed to attendant, to treat that attendant is investigated simultaneously Solve the problems, such as, causing to alarm, the degree of accuracy is low and poor real.
Therefore, in the prior art using distributed type assemblies warning system under each clustered node in distributed file system Single service equipment hardware and operating system produced problem carry out single-point alarm, causing to alarm, the degree of accuracy is low and real-time Difference.
The content of the invention
The purpose of the application is to provide a kind of method and apparatus for being used to check cluster health status, to solve prior art The middle hardware using distributed type assemblies warning system to the single service equipment under each clustered node in distributed file system Single-point alarm is carried out with operating system produced problem, causing to alarm, the degree of accuracy is low and the problem of poor real.
According to the one side of the application, there is provided a kind of method for checking cluster health status, including:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the prison for checking regular related checkpoint Data are controlled, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back strong Health warning information.
Further, polymerization processing is carried out to the monitoring data to be included with obtaining result:
Based on it is described to be checked the problem of it is corresponding check rule, the monitoring data of each checkpoint is carried out respectively Processing, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
According to the one side of the application, there is provided a kind of method for checking cluster health status, in addition to:
Establishment problem rule base, described problem rule base include at least one problem and its corresponding inspection rule;
The problem of in described problem rule base and its corresponding inspection rule are updated.
Further, in described problem rule base the problem of and its corresponding inspection rule be updated including:
Obtain the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;
Threshold value is initially monitored based on described, when the appearance of the problem to be updated is obtained from the relevant information of the cluster Between in setting time section before point and time of occurrence point all checkpoints monitoring data, based on the monitoring number According to determining and recording the abnormal checkpoint;
When the problem to be updated occurs in each setting time section, based on institute in presently described setting time section The abnormal checkpoint of record and the abnormal checkpoint of historical record, update each checkpoint and treated described Probability of occurrence when replacement problem occurs;
Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, described in renewal The inspection rule of problem to be updated.
According to the another aspect of the application, a kind of equipment for checking cluster health status is additionally provided, including:
Information acquisition device, for obtaining the relevant information of cluster to be checked;
Rule device, for obtaining the problem of at least one to be checked and its corresponding inspection rule;
Processing unit is monitored, for the relevant information based on the cluster, is obtained and the check gauge from the cluster The then monitoring data of related checkpoint, and polymerization processing is carried out to the monitoring data to obtain result;
Early warning feedback device, for transferring corresponding described problem, and the phase based on described problem based on the result Information is closed to generate and feed back healthy early warning information.
Further, the monitoring processing unit includes:
Data processing unit, for based on it is described to be checked the problem of it is corresponding check rule to each checkpoint Monitoring data is respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing knot Fruit.
According to the one side of the application, there is provided a kind of equipment for checking cluster health status, in addition to:
Rules device is created, for creating problem rule base, described problem rule base includes at least one problem and its right The inspection rule answered;
Policy Updates device, the problem of in described problem rule base and its corresponding inspection rule is carried out more Newly.
Further, the Policy Updates device includes:
First information acquiring unit, for obtaining the relevant information, problem to be updated and its initial prison of cluster to be checked Control threshold value;
First recording unit, for initially monitoring threshold value based on described, from the relevant information of the cluster described in acquisition The monitoring number of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of problem to be updated According to determining and recording the abnormal checkpoint based on the monitoring data;
First probability updating unit, during for the problem to be updated to occur in each setting time section, it is based on The abnormal checkpoint recorded in presently described setting time section and the abnormal checkpoint of historical record, renewal Probability of occurrence of each checkpoint when the problem to be updated occurs;
First Policy Updates unit, for based on the checkpoint of the probability of occurrence after renewal higher than setting probability And its relevant information, the inspection for updating the problem to be updated are regular.
In addition, present invention also provides a kind of equipment for checking cluster health status, including:
Processor;
And it is arranged to store the memory of computer executable instructions, the executable instruction makes institute when executed State processor:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the prison for checking regular related checkpoint Data are controlled, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back strong Health warning information.
Compared with prior art, a kind of of embodiments herein offer is used to check the method for cluster health status and set It is standby, by the relevant information for obtaining cluster to be checked;Obtain the problem of at least one to be checked and its corresponding inspection rule; Based on the relevant information of the cluster, the monitoring data to the regular related checkpoint of the inspection is obtained from the cluster, And polymerization processing is carried out to the monitoring data to obtain result;Corresponding described problem is transferred based on the result, And the relevant information based on described problem generates and feeds back healthy early warning information.Because the distributed file system on to line is entered Before the anticipation of row health status, will occur the problem of abnormal to be checked on the line of distributed file system as far as possible is carried out correspondingly Regularization corresponding check rule to obtain the problem of described to be checked so that distributed file system on to line is carried out When health status prejudges, monitoring data corresponding to each checkpoint can be directly obtained, and check rule to the inspection using described The monitoring data made an inventory of carries out result of the polymerization processing to be handled, and improves and multiple inspections under each clustered node are clicked through The degree of accuracy of row health status monitoring, and corresponding described problem, and the related letter based on described problem are transferred based on result Breath generates and feeds back healthy early warning information, to treat the healthy early warning information of the attendant based on feedback under each clustered node The each checkpoint pinpointed the problems gives warning in advance and handles relevant health warning information, so as to improve to the distribution text on line Part system carries out the real-time of more checkpoint monitoring, and reaches the purpose of Multiple Dots Alarm Rules in advance;Further, to the monitoring number Included according to polymerization processing is carried out with obtaining result:Based on it is described to be checked the problem of it is corresponding check rule, to each institute The monitoring data for stating checkpoint is respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and anti- Result being presented, realizing the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and is improved pair The degree of accuracy of the health status anticipation of each checkpoint corresponding to described problem in cluster.
Further, a kind of method and apparatus for checking cluster health status that embodiments herein provides, also By creating problem rule base, described problem rule base includes at least one problem and its corresponding inspection rule;Ask described The problem of in topic rule base and its corresponding inspection rule are updated, and ensure that to most in the distributed file system on line The each checkpoint for being likely to occur problem check the establishment of rule, and the monitoring data based on each checkpoint is to described problem The problem of in rule base and its corresponding inspection rule are updated, with ensure create the problem of rule base in can be more comprehensively Abnormal examination point in more accurate reaction profile formula file system, and realize corresponding multiple inspections during to described problem occur The monitoring of the health status of point, and improve the health status anticipation to each checkpoint corresponding to the described problem in cluster The degree of accuracy and real-time.
Further, in described problem rule base the problem of and its corresponding inspection rule be updated including:Obtain Take the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;Based on the initial monitoring threshold value, from institute State the setting time before time of occurrence point and time of occurrence point that the problem to be updated is obtained in the relevant information of cluster The monitoring data of all checkpoints, the abnormal checkpoint is determined and recorded based on the monitoring data in section;Every It is abnormal based on being recorded in presently described setting time section when the problem to be updated occurring in the individual setting time section The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated When probability of occurrence;It is higher than the checkpoint and its relevant information of setting probability based on the probability of occurrence after renewal, more The inspection rule of the new problem to be updated so that pass through the setting time section before the time of occurrence point to the problem to be updated The monitoring data of interior all checkpoints, prejudged based on the initial monitoring threshold value, and in each setting Between when the problem to be updated occurring in section, based on the abnormal checkpoint recorded in presently described setting time section with The abnormal checkpoint of historical record, it is general to update appearance of each checkpoint when the problem to be updated occurs Rate, so as to which the probability of occurrence after renewal to be higher than to the checkpoint and its relevant information of setting probability, treated described in renewal The inspection rule of replacement problem, so as to update described problem rule base by updating the inspection rule of the problem to be updated, The abnormal examination point that described problem rule base is more comprehensively more accurately reflected in distributed file system, and realization pair There is the monitoring of the health status of corresponding multiple checkpoints during described problem, and improve corresponding to the described problem in cluster Each checkpoint health status anticipation the degree of accuracy and real-time.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows a kind of method flow schematic diagram for being used to check cluster health status according to the application one side;
Fig. 2 shows corresponding wound in a kind of method for checking cluster health status according to the application another aspect Build the method flow schematic diagram of problem rule base;
Fig. 3 is shown according to corresponding in a kind of method for checking cluster health status provided in the embodiment of the application one The method flow schematic diagram for creating step S16 corresponding to problem base rule;
Fig. 4 is shown according to right in a kind of method for checking cluster health status provided in the application another embodiment That answers creates step S16 corresponding to problem base rule method flow schematic diagram;
Fig. 5 shows a kind of device structure schematic diagram for being used to check cluster health status according to the application one side;
Fig. 6 shows corresponding wound in a kind of equipment for checking cluster health status according to the application another aspect Build the device structure schematic diagram of problem rule base;
Fig. 7 is shown according to a kind of rule for being used to check in the equipment of cluster health status provided in the embodiment of the application one The then structural representation of updating device 16;
Fig. 8 is shown according in a kind of equipment for checking cluster health status provided in the application another embodiment The structural representation of Policy Updates device 16.
Same or analogous reference represents same or analogous part in accompanying drawing.
Embodiment
The application is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 shows a kind of method flow schematic diagram for being used to check cluster health status according to the application one side. The method comprising the steps of S11, step S12, step S13 and step S14.
Wherein, the step S11:Obtain the relevant information of cluster to be checked;The step S12:Obtain at least one The problem of to be checked and its corresponding inspection rule;The step S13:Based on the relevant information of the cluster, from the cluster Middle acquisition and the monitoring data for checking regular related checkpoint, and polymerization processing is carried out to the monitoring data to obtain Result;The step S14:Corresponding described problem, and the related letter based on described problem are transferred based on the result Breath generates and feeds back healthy early warning information.
In embodiments herein, the cluster to be checked in the step S11 is to be located at distributed field system On one or more of system clustered node, wherein, the distributed file system refers to the physical store of file system management Resource is not necessarily connected directly between local node, but is connected by computer network with node.File system in a distributed manner below Specific embodiment is carried out exemplified by system to the application to explain in detail.Certainly, herein using exemplified by distributed file system to this Shen Come in row specific embodiment explains in detail, purpose only by way of example, embodiments herein not limited to this, in others Following embodiments can be equally realized in distributed cluster system.
Further, the checkpoint includes following at least any one:In hardware device, the cluster in the cluster Software equipment local module.
It should be noted that the checkpoint in the step S13 can including but not limited to include distributed text The office of software equipment in the hardware device and distributed file system of the individual server under each clustered node in part system Portion's module.Wherein, the hardware device of the server include central processing unit, internal memory, hard disk, chipset, Input/output bus, input-output equipment, power supply and cabinet etc., the local module of the software equipment include but is not limited to system Program module, fault diagnostic program module and exception handles module etc. are set.Certainly, other are existing or may go out from now on The existing checkpoint is such as applicable to the application, should also be included within the application protection domain, and herein by reference It is incorporated herein.
Further, the step S11 includes:Obtain the relevant information of cluster to be checked;Specifically, the step S11 includes:The request submitted based on user, the relevant information of cluster to be checked is obtained, wherein, the relevant information includes: Cluster position information and review time section.
In embodiments herein, obtained when needing and the health status of the distributed file system on line being monitored The request of family submission is taken, the cluster position information of the cluster to be checked of the acquisition request based on user's submission and monitoring institute State the review time section of cluster to be checked, wherein the cluster position information and the review time section belong to it is described to be checked The relevant information for the cluster looked into.
For example, pair that user submits is obtained when needing to be monitored the health status of the distributed file system on line The request that each checkpoint in cluster is monitored, where cluster to be checked corresponding to the acquisition request submitted based on user Cluster position information and obtain multiple checkpoints monitoring data corresponding to one or more review time sections, wherein, institute The actual geographic position range that cluster position information can be distributed across where the clustered node of different zones is stated, can also be same Actual geographic position range where the clustered node of one region.
Further, the step S12 includes:Obtain the problem of at least one to be checked and its corresponding inspection rule; Specifically, the step S12 includes:The problem of at least one to be checked and its corresponding check gauge are obtained from problem rule base Then.
It should be noted that the problem of described problem rule base in the step S12 mainly includes having had built up and It is multiple corresponding to it to check rule.Wherein, described problem include RAM leakage, read-write long-tail, loss of data, With system performance problems, with system availability problem and service quality problem etc.;The inspection is regular including checkpoint and its right The outlier threshold for the monitoring data answered.Certainly, other described problem rule bases that are existing or being likely to occur from now on are for example applicable In the application, should also be included within the application protection domain, and be incorporated herein by reference herein.
For example, the problem of described problem rule base is present is memory overflow, then check that rule includes corresponding to it:The inspection The rate of change and its corresponding outlier threshold, the checkpoint made an inventory of as nearly one week traffic pressure are establishment file total amount and its right The outlier threshold answered and the checkpoint are that internal memory uses growth slope and its corresponding outlier threshold;Described problem rule stock The problem of for read-write long-tail, then check that rule includes corresponding to it:The checkpoint be nearly one week read-write calling frequency and Its corresponding outlier threshold, the checkpoint are the retransmission rate of network and its corresponding outlier threshold and the checkpoint in cluster For the disk health status score information in cluster and its corresponding outlier threshold.
Further, being obtained from the cluster and the inspection regular related checkpoint in the step S13 Monitoring data includes:Based on cluster described in the cluster position information searching, and obtain in the cluster and check rule with described Related checkpoint;The monitoring number of the related checkpoint in the review time section is obtained from the monitoring module of the cluster According to.
It should be noted that the monitoring module of the cluster is mainly responsible each from the monitoring system acquisition in the cluster The monitoring data of each checkpoint of hardware device and software equipment correlation.Certainly, other are existing or are from now on likely to occur The monitoring module of the cluster is such as applicable to the application, should also be included within the application protection domain, and herein with reference Mode is incorporated herein.
In above-described embodiment of the application, if the cluster position information is Shanghai geography position in the step S13 Confidence ceases, then finds the Shanghai cluster based on the Shanghai geographical location information, and obtain from the Shanghai cluster with Each checkpoint for checking that rule is related;Then obtained from the monitoring module of the Shanghai cluster in the review time section The monitoring data of related each checkpoint, the then monitoring data for having the checkpoint got to be establishment file total amount are 34th, the checkpoint be internal memory using increase slope monitoring data be 48%, the checkpoint be nearly one week traffic pressure The monitoring data of rate of change is 1%, and it is 75.6% that the monitoring data of frequency is called in the read-write that the checkpoint is nearly one week, described Checkpoint is that the monitoring data of the retransmission rate of network in cluster is 5.3%, and the checkpoint is the disk health status in cluster The monitoring data of score information is 15.
Further, the polymerization processing that carried out to the monitoring data in the step S13 is included with obtaining result: Based on it is described to be checked the problem of it is corresponding check rule, the monitoring data of each checkpoint is respectively processed, with Obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
In above-described embodiment of the application, in the step S133 can by based on it is described to be checked the problem of pair The inspection rule answered, is respectively compared the monitoring datas of multiple checkpoints to judge whether the problem of described to be checked.It is if pre- Sentence the problem of distributed file system on line whether there is RAM leakage, then can be by respectively to nearly one week traffic pressure Rate of change, establishment file total amount, internal memory are corresponding described using monitoring data progress corresponding to these three checkpoints of slope is increased The matching of rule is checked, to obtain result to be prejudged;If the distributed file system on line is prejudged with the presence or absence of reading The problem of writing long-tail, then can be by calling frequency to the read-write of nearly one week respectively, the retransmission rate of network in cluster, in cluster Monitoring data corresponding to these three checkpoints of disk health status score information carry out the corresponding matching for checking rule with Result is obtained to be prejudged.
If for example, in the inspection rule of RAM leakage, the checkpoint is establishment file described to be checked the problem of The outlier threshold of total amount is 30, because monitoring data that the checkpoint is establishment file total amount is 34 to have exceeded the abnormal threshold Value 30, then the checkpoint is that establishment file total amount exception occurs;The checkpoint is that internal memory uses the abnormal threshold for increasing slope It is worth for 20%, is 48% to exceed the outlier threshold because the checkpoint is internal memory using the monitoring data for increasing slope 20%, then the checkpoint is that internal memory exception occurs using slope is increased, and the checkpoint is the change of nearly one week traffic pressure The outlier threshold of rate is 14%, because the monitoring data for the rate of change that the checkpoint is nearly one week traffic pressure is 1% to be less than The outlier threshold 14%, then the checkpoint is normal for the rate of change of nearly one week traffic pressure;If asked in described to be checked In the inspection rule of entitled read-write long-tail, the outlier threshold for the read-write calling frequency that the checkpoint is nearly one week is 30%, by It is 75.6% to exceed the outlier threshold 30% that the monitoring data of frequency is called in the read-write for being nearly one week in the checkpoint, then institute Stating the read-write that checkpoint is nearly one week calls frequency exception occur, and the checkpoint is the abnormal threshold of the retransmission rate of network in cluster It is worth for 10%, because monitoring data that the checkpoint is the retransmission rate of network in cluster is 5.3% to be less than the outlier threshold 10%, then the checkpoint is that the retransmission rate of network in cluster is normal, and the checkpoint is the disk health status point in cluster The outlier threshold of value information is 60, because the monitoring data that the checkpoint is the disk health status score information in cluster is 15 are less than the outlier threshold 60, then the checkpoint is that the disk health status score information in cluster is normal, therefore is obtained Result the problem of being described to be checked be to check that the checkpoint in rule is establishment file corresponding to RAM leakage There is abnormal, described checkpoint and the problem of abnormal, described to be checked occurs using slope is increased for read-write long-tail for internal memory in total amount It is abnormal that the read-write that the corresponding checkpoint checked in rule is nearly one week calls frequency to occur.
In above-described embodiment of the application, the problem of described to be checked corresponding inspection is based in the step S13 After rule is handled the monitoring data of each checkpoint respectively, corresponding result is obtained;Then, in the step In S14, corresponding described problem is transferred based on the result, because the result is:The problem of described to be checked is It is that exception occurs in establishment file total amount, the checkpoint is that internal memory makes that the checkpoint in rule is checked corresponding to RAM leakage It is to read and write to check that the checkpoint in rule is near corresponding to long-tail with increasing slope the problem of abnormal, described to be checked occur It is abnormal that the read-write of one week calls frequency to occur, then it is RAM leakage and read-write long-tail to transfer corresponding described problem, in the step In S14, the relevant information based on described problem generates and feeds back healthy early warning information.
Further, the relevant information of described problem includes at least any one of following:It is the time of occurrence of described problem, each The monitoring data of the related checkpoint, occur that the monitoring data abnormal checkpoint occurs during described problem.
Then above-described embodiment, the relevant information generation healthy early warning information based on described problem, then have the health pre- Alert information includes described problem and its corresponding time of occurrence and each described of monitoring data exception occurs when there is described problem Checkpoint and its monitoring data.
For example, it is based on the result:The problem of described to be checked, is in inspection rule corresponding to RAM leakage The checkpoint is that exception occurs in establishment file total amount, the checkpoint is that internal memory is abnormal using slope appearance is increased, described to treat The problem of inspection is to check that the read-write that the checkpoint in rule is nearly one week calls frequency appearance different corresponding to read-write long-tail Often, then it is RAM leakage and read-write long-tail to transfer corresponding described problem;And based on the relevant information of described problem according to described point Early warning report template in cloth file system generates and feeds back the healthy early warning information, wherein the healthy early warning generated Information is { { RAM leakage:In t1, establishment file total amount is that 34 appearance are abnormal, and in t2, internal memory is gone out using slope 48% is increased It is now abnormal };{ read-write long-tail:In t3, the read-write of nearly one week calls frequency 75.6% exception occur } }, to feed back to system maintenance Personnel, to treat the healthy early warning information each inspection to the cluster under pinpoint the problems of the system maintenance personnel based on feedback Point is given warning in advance and handles relevant health warning information, and more checkpoints are carried out to the distributed file system on line so as to improve The real-time of monitoring, and reach the early warning to multiple checkpoints in the cluster in advance and handle the mesh of healthy early warning information , also improve the degree of accuracy of the health status anticipation to each checkpoint corresponding to the described problem in cluster.
In the step S14, if the monitoring data of each checkpoint is all without super in all results The outlier threshold is crossed, then generates health status information, to treat that the distributed file system attendant understands whole distribution Formula file system is in health status, without carrying out healthy early warning processing.
In embodiments herein, described problem rule base is being utilized to each cluster in distributed sort system Checkpoint monitoring data carry out polymerization calculating during, it is also necessary to described problem rule base carry out create and constantly Renewal is as shown in Figure 2.
Fig. 2 shows corresponding wound in a kind of method for checking cluster health status according to the application another aspect Build the method flow schematic diagram of problem rule base.The method comprising the steps of S15 and step S16.
Wherein, the step S15 includes:Establishment problem rule base, described problem rule base include at least one problem and Rule is checked corresponding to it;The step S16 includes:The problem of in described problem rule base and its corresponding inspection rule It is updated.
In embodiments herein, the monitoring data of each checkpoint in the distributed file system is carried out Before polymerization calculates, described problem rule base need to be created, wherein described problem rule base includes at least one problem and each institute State and rule is checked corresponding to problem, in the exception for checking rule and including at least one checkpoint and each checkpoint Threshold value, i.e., had before problem occurs multiple checkpoints occur it is abnormal, and based on occur abnormal multiple checkpoints come There is abnormal corresponding described problem in anticipation.
For example, described problem rule base includes problem 1, problem 2 and problem 3, wherein, examined corresponding to described problem 1 It is { problem 1 to look into rule:Checkpoint A outlier threshold is A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold For C1 };Check that rule is { problem 2 corresponding to described problem 2:Checkpoint D outlier threshold is D1, checkpoint E outlier threshold Outlier threshold for E1 and checkpoint F is F1 };Check that rule is { problem 3 corresponding to described problem 3:Checkpoint G abnormal threshold The outlier threshold being worth for G1 and checkpoint H is G1 }.
With exploding for user's mass data, the scale of the distributed file system is also constantly increasing, due to rule It is actual before a problem occurs during the anticipation of the health status of distributed file system of the mould on ever-increasing line Have multiple checkpoints and occur exception in advance, then need each inspection in the fixed time period before occurring according to described problem The abnormal monitoring data of point is iterated calculating, can most react corresponding check gauge when described problem occurs abnormal to find Then, as shown in Figure 3.
Fig. 3 is shown according to corresponding in a kind of method for checking cluster health status provided in the embodiment of the application one The method flow schematic diagram for creating step S16 corresponding to problem base rule.This method includes:Step S161, step S162, step Rapid S163 and step S164.
Wherein, the step S161 obtains the relevant information of cluster to be checked, problem to be updated and its initially monitors threshold Value;The step S162 is based on the initial monitoring threshold value, and the problem to be updated is obtained from the relevant information of the cluster Time of occurrence point and time of occurrence point before setting time section in all checkpoints monitoring data, based on institute State monitoring data and determine and record the abnormal checkpoint;In each setting time section institute occurs for the step S163 The abnormal checkpoint recorded when stating problem to be updated based on presently described setting time section and the exception of historical record The checkpoint, update the probability of occurrence of each checkpoint when the problem to be updated occurs;The step S164 Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, renewal is described to be updated to ask The inspection rule of topic.
In embodiments herein, when being updated need in described problem rule base the problem of, first, step The problem to be updated that S161 obtains the cluster position information and review time section of cluster to be checked and is trained And its corresponding initial monitoring threshold value;Then, the step S162 is believed based on the initial monitoring threshold value from the cluster position Cluster corresponding to breath is lower obtains the time of occurrence point of the problem to be updated in the review time section and the time of occurrence The monitoring data of all checkpoints in setting time section before point, and the abnormal checkpoint is occurred into monitoring data and remembered Record is got off;Then, when in each setting time section the problem to be updated occurs for the step S163, based on current institute The abnormal checkpoint recorded in setting time section and the abnormal checkpoint of historical record are stated, updates each institute State probability of occurrence of the checkpoint when the problem to be updated occurs;Finally, the step S164 is based on going out described in after renewal The checkpoint of the existing probability higher than setting probability and its relevant information, the inspection rule of the problem to be updated is updated, so as to Described problem rule base is updated by updating the inspection rule of the problem to be updated so that described problem rule base can be more Comprehensively more accurately reflection distributed file system in abnormal examination point, and realize to there is described problem when it is corresponding multiple The monitoring of the health status of checkpoint, and improve pre- to the health status of each checkpoint corresponding to the described problem in cluster The degree of accuracy sentenced and real-time.
Further, the initial monitoring threshold value includes:The outlier threshold of the monitoring data of all checkpoints and go out The now weight threshold of the abnormal checkpoint;The step 162 includes:Based on the initial monitoring threshold value, from the cluster Relevant information in obtain institute in the setting time section before the time of occurrence point and time of occurrence point of the problem to be updated There is the monitoring data of the checkpoint, the abnormal checkpoint is determined and recorded based on the monitoring data;Specifically, it is described Step 162 includes:The outlier threshold of monitoring data based on all checkpoints, is obtained from the relevant information of the cluster The prison of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of the problem to be updated Data are controlled, and the weight of the checkpoint of recording exceptional exceedes the corresponding checkpoint during weight threshold, wherein, institute The probability of occurrence for stating the checkpoint of the weight of checkpoint based on exception determines.
It should be noted that when the problem to be updated occurs, the probability of occurrence of the checkpoint and the checkpoint Weight computational methods it is as follows.If the problem to be updated 1 occurs 1000 times in the setting time section, the checkpoint A occurs 654 times, and the checkpoint B occurs 252 times, and the checkpoint C occurs 94 times, then has the probability of occurrence of the checkpoint A For 65.4%, the probability of occurrence of the checkpoint B is 25.2%, and the probability of occurrence of the checkpoint C is 9.4%, wherein, it is described Checkpoint A weight is:65.4%/(65.4%+25.2%+9.4%)=65.4%, the weight of the checkpoint B are: 25.2%/(65.4%+25.2%+9.4%)=25.2%, the weight of the checkpoint C are:9.4%/(65.4%+25.2% + 9.4%)=9.4%.Certainly, other existing or the probability of occurrence of the checkpoint being likely to occur from now on and described inspections The computational methods of the weight of point are such as applicable to the application, should also be included within the application protection domain, and herein with reference Mode is incorporated herein.
Preferably, abnormal checkpoint bag is determined and recorded based on the monitoring data in the step S162 Include:Judge whether the monitoring data of the checkpoint exceeds outlier threshold;If being determined and recorded if described in corresponding exception Checkpoint.
Check that rule is { problem 1 corresponding to the problem to be updated 1 in described problem database for example, obtaining:Checkpoint A Outlier threshold be A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold is C1 and the checkpoint Weight threshold is 10%, is treated more described in the cluster position information checked based on needs and review time section acquisition Setting time section (t+ △ t) and (t+2 △ t) before the time of occurrence point t and time of occurrence point t of new problem 1 The monitoring data of all checkpoints corresponding to interior difference, and whether the monitoring data based on the checkpoint exceeds abnormal threshold Value records the checkpoint of corresponding exception.If in the setting time section (t+ △ t) before the time of occurrence point t, The monitoring data of the checkpoint obtained has respectively beyond its corresponding outlier threshold:Checkpoint A, checkpoint B and inspection C is made an inventory of, wherein, the probability of occurrence of the checkpoint A is 65.4% in the setting time section (t+ △ t), the checkpoint B Probability of occurrence be 25.2%, the probability of occurrence of the checkpoint C is 9.4%, and weight calculation is carried out based on the probability of occurrence, Then the weight of the checkpoint is the probability of occurrence and the probability of occurrence sum of all checkpoints of each checkpoint Ratio information, the weight for obtaining the checkpoint A in the setting time section (t+ △ t) are 65.4%, the checkpoint B's Weight is 25.2%, and the weight of the checkpoint C is 9.4%, and the weight threshold based on the checkpoint is 10%, so in institute State in setting time section (t+ △ t), the weight of the checkpoint of recording exceptional exceed it is corresponding during the weight threshold described in Checkpoint is the checkpoint A and its weight 65.4% and the checkpoint B and its weight 25.2%.If in the time of occurrence In setting time section (t+2 △ t) before point t, the monitoring data of the checkpoint of acquisition is beyond abnormal threshold corresponding to it Value has respectively:Checkpoint A, checkpoint B and checkpoint D, wherein, the inspection in the setting time section (t+2 △ t) Point A probability of occurrence is 50.5%, and the probability of occurrence of the checkpoint B is 1.4%, and the probability of occurrence of the checkpoint D is 48.1%, weight calculation is carried out based on the probability of occurrence, obtains the checkpoint A in the setting time section (t+2 △ t) Weight be 50.5%, the weight of the checkpoint B is 1.4%, and the weight of the checkpoint D is 48.1%, based on the inspection The weight threshold made an inventory of is 10%, so in the setting time section (t+2 △ t), the power of the checkpoint of recording exceptional The corresponding checkpoint is the checkpoint A and its weight 50.5% and the checkpoint D when exceeding the weight threshold again And its weight 48.1%.
Further, the step S163 includes:When the problem to be updated occurs in each setting time section, Based on the abnormal checkpoint recorded in presently described setting time section and the abnormal checkpoint of historical record, Update probability of occurrence of each checkpoint when the problem to be updated occurs;Specifically, the step S163 includes: When the problem to be updated occurs in each setting time section, recorded based on presently described setting time section abnormal The probability of occurrence of the checkpoint determines the present weight of the checkpoint in presently described setting time section;Based on the inspection The present weight and the history weight of the abnormal checkpoint of historical record made an inventory of, each checkpoint is updated described Probability of occurrence when problem to be updated occurs.
Then above-described embodiment of the application, if the problem to be updated 1 is when the preceding current setting of exception occurs in moment t Between in section (t+ △ t), the weight of the checkpoint of recording exceptional the corresponding checkpoint is when exceeding the weight threshold The checkpoint A and its present weight 65.4% and the checkpoint B and its present weight 25.2%, wherein, the checkpoint Probability of occurrence of the present weight based on the checkpoint determine;If the problem to be updated 1 occurs in moment t, exception is preceding to be gone through In setting time section described in history (t+2 △ t), the weight of the checkpoint of recording exceptional exceedes corresponding during the weight threshold The checkpoint is the checkpoint A and its history weight 50.5% and the checkpoint D and its history weight 48.1%;Then base Present weight and history weight in each checkpoint of the problem 1 to be updated, each checkpoint is updated in institute Probability of occurrence when problem to be updated occurs is stated, that is, updates synthesis of each checkpoint when the problem to be updated occurs Weight, wherein, the comprehensive weight of the checkpoint is the present weight of the checkpoint and the average of history weight, then has described Checkpoint A comprehensive weight corresponding to problem to be updated is (65.4%+50.5%)/2=57.95%, and the checkpoint B's is comprehensive Conjunction weight is (25.2%+1.4%)/2=13.3%, and the comprehensive weight of the checkpoint D is (0+48.1%)/2= 24.05%, then the history weight of the checkpoint of weight and historical record in the working days based on the checkpoint, renewal is each Probability of occurrence of the checkpoint when the problem 1 to be updated occurs, that is, the probability of occurrence of the checkpoint A after updating are 57.95%, the checkpoint B after renewal probability of occurrence are 13.3%, and the probability of occurrence of the checkpoint D after renewal is 24.05%.
Then above-described embodiment of the application, the setting probability and the weight of the checkpoint in the step S164 The numerical value of threshold value is consistent, i.e., the described probability that sets is 10%, due to checkpoint A and its renewal corresponding to the problem 1 to be updated Probability of occurrence 57.95% afterwards higher than it is described setting probability 10%, checkpoint B and its renewal after probability of occurrence 13.3% be higher than Probability of occurrence 24.05% after the setting probability 10%, checkpoint D and its renewal then will higher than the setting probability 10% The checkpoint C abandons from the inspection rule of the problem 1 to be updated in described problem rule base, by the checkpoint and Its corresponding outlier threshold is added in the inspection rule of the problem to be updated in described problem rule base, and based on described in more The probability of occurrence after new higher than the setting probability the checkpoint A, the checkpoint B and the checkpoint D and its Relevant information, update the inspection rule of the problem to be updated.
Further, the relevant information of the checkpoint includes following at least any one:The monitoring data of the checkpoint Outlier threshold, the weight of the checkpoint, wherein, probability of occurrence of the weight of the checkpoint based on the checkpoint is true It is fixed.
Then above-described embodiment of the application, by the checkpoint A and its outlier threshold A1 of corresponding monitoring data and The outlier threshold B1 and weight 13.3% of weight 57.95%, the checkpoint B and its corresponding monitoring data, and the inspection The outlier threshold D1 and weight 24.05% of point D and its corresponding monitoring data enter as the inspection rule of the problem 1 to be updated Row renewal.
With carrying out constantly health examination and obtaining healthy warning information to the distributed file system, and it is based on institute State healthy early warning information carry out advanced processing during, user can get more than one to the distributed file system Inspection result information after being handled, multiple checkpoints are actually had before a problem occurs, and appearance is abnormal in advance, Then need the inspection result information based on acquisition, each checkpoint in the fixed time period before being occurred according to described problem Abnormal monitoring data is iterated calculating, can most react corresponding inspection rule when described problem occurs abnormal to find, such as Shown in Fig. 4.
Fig. 4 is shown according to right in a kind of method for checking cluster health status provided in the application another embodiment That answers creates step S16 corresponding to problem base rule method flow schematic diagram.This method includes:Step S165, step S166, Step S167 and step S168.
Wherein, the step S165 obtains problem to be updated, and obtains institute from least one inspection result information State the time of occurrence point of problem to be updated;All institutes in setting time section before the step S166 acquisitions time of occurrence point The monitoring data of checkpoint is stated, the abnormal checkpoint is determined and recorded based on the monitoring data;The step S167 bases In the abnormal checkpoint recorded in presently described setting time section and the abnormal checkpoint of historical record, more Probability of occurrence of the new each checkpoint when the problem to be updated occurs;The step S168 is based on described in after renewal The checkpoint of the probability of occurrence higher than setting probability and its relevant information, update the inspection rule of the problem to be updated.
It should be noted that the inspection result information is during checking the distributed file system The object information related to healthy early warning information obtained.The inspection result information includes at least any one of following:Occur different Normal described problem, when there is the time of occurrence point of described problem, described problem occur it is corresponding occur abnormal checkpoint and its Outlier threshold.Certainly, other described inspection result information that are existing or being likely to occur from now on are such as applicable to the application, also should Within the application protection domain, and it is incorporated herein by reference herein.
It is first, described when being updated need in described problem rule base the problem of in embodiments herein Step S165 obtains problem to be updated, and the appearance of the problem to be updated is obtained from least one inspection result information Time point;Then, the step S166 obtains the prison of all checkpoints in the setting time section before the time of occurrence point Data are controlled, monitoring data and its outlier threshold based on the checkpoint determine and record the abnormal checkpoint;Then, institute Step S167 is stated based on the abnormal of the abnormal checkpoint and historical record recorded in presently described setting time section The checkpoint, update probability of occurrence of each checkpoint when the problem to be updated occurs;Finally, the step S168 is higher than the checkpoint and its relevant information of setting probability based on the probability of occurrence after renewal, is treated more described in renewal The inspection rule of new problem, so as to update the problem to be updated by least one inspection result information of acquisition Rule is checked to update described problem rule base so that described problem rule base being capable of the more comprehensively more accurate distributed text of reflection Abnormal examination point in part system, and the monitoring of the health status of corresponding multiple checkpoints during to described problem occur is realized, And improve the degree of accuracy and the real-time of the health status anticipation to each checkpoint corresponding to the described problem in cluster.
Fig. 5 shows a kind of device structure schematic diagram for being used to check cluster health status according to the application one side. This method includes information acquisition device 11, Rule device 12, monitoring processing unit 13 and early warning feedback device 14.
Wherein, described information acquisition device 11 obtains the relevant information of cluster to be checked;The Rule device 12 Obtain the problem of at least one to be checked and its corresponding inspection rule;Monitor related letter of the processing unit 13 based on the cluster Breath, obtains the monitoring data of the checkpoint related to the inspection rule from the cluster, and the monitoring data is carried out Polymerization is handled to obtain result;Early warning feedback device 14 transfers corresponding described problem based on the result, and is based on The relevant information of described problem generates and feeds back healthy early warning information.
Here, the equipment 1 includes but is not limited to user equipment or user equipment is integrated with the network equipment by network The equipment formed.The user equipment its include but is not limited to any one and man-machine interaction can be carried out by touch pad with user Mobile electronic product, such as smart mobile phone, PDA etc., the mobile electronic product can use any operating system, such as Android operating systems, iOS operating systems etc..Wherein, the network equipment include one kind can be according to being previously set or store Instruction, the automatic electronic equipment for carrying out numerical computations and information processing, its hardware includes but is not limited to microprocessor, special collection Into circuit (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded device etc..The network is included but not It is limited to internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN, wireless self-organization network (Ad Hoc networks) etc..Preferably, The central schedule equipment can also be that run on the user equipment is integrated setting of being formed with the network equipment by network Standby upper shell script.Certainly, those skilled in the art will be understood that above-mentioned central schedule equipment is only for example, and other are existing Or the central schedule equipment being likely to occur from now on is such as applicable to the application, also should be included in the application protection domain with It is interior, and be incorporated herein by reference herein.
Constantly worked between above-mentioned each device, here, it will be understood by those skilled in the art that " lasting " refers to Each device is stated to require in real time or according to the mode of operation of setting or real-time adjustment respectively.
In embodiments herein, the cluster to be checked in described information acquisition device 11 is positioned at distribution On one or more of file system clustered node, wherein, the distributed file system refers to the thing of file system management Reason storage resource is not necessarily connected directly between local node, but is connected by computer network with node.Below in a distributed manner Specific embodiment is carried out exemplified by file system to the application to explain in detail.Certainly, herein using distributed file system exemplified by Specific embodiment is carried out to the application to explain in detail, purpose only by way of example, embodiments herein not limited to this, Following embodiments can be equally realized in other distributed cluster systems.
Further, the checkpoint includes following at least any one:In hardware device, the cluster in the cluster Software equipment local module.
It should be noted that the checkpoint in the monitoring processing unit 13 can including but not limited to include dividing Software in the hardware device and distributed file system of the individual server under each clustered node in cloth file system is set Standby local module.Wherein, the hardware device of the server includes central processing unit, internal memory, hard disk, core Piece group, input/output bus, input-output equipment, power supply and cabinet etc., the local module of the software equipment include but unlimited In system, program module, fault diagnostic program module and exception handles module etc. are set.Certainly, other are existing or from now on The checkpoint being likely to occur such as is applicable to the application, should also be included within the application protection domain, and herein to draw It is incorporated herein with mode.
Further, the request that described information acquisition device 11 is submitted based on user, the correlation of cluster to be checked is obtained Information, wherein, the relevant information includes:Cluster position information and review time section.
In embodiments herein, obtained when needing and the health status of the distributed file system on line being monitored The request of family submission is taken, the cluster position information of the cluster to be checked of the acquisition request based on user's submission and monitoring institute State the review time section of cluster to be checked, wherein the cluster position information and the review time section belong to it is described to be checked The relevant information for the cluster looked into.
For example, pair that user submits is obtained when needing to be monitored the health status of the distributed file system on line The request that each checkpoint in cluster is monitored, where cluster to be checked corresponding to the acquisition request submitted based on user Cluster position information and obtain multiple checkpoints monitoring data corresponding to one or more review time sections, wherein, institute The actual geographic position range that cluster position information can be distributed across where the clustered node of different zones is stated, can also be same Actual geographic position range where the clustered node of one region.
Further, the Rule device 12 obtained from problem rule base the problem of at least one to be checked and its It is corresponding to check rule.
It should be noted that the described problem rule base in the Rule device 12 mainly includes what is had built up Problem and its corresponding multiple inspection rules.Wherein, described problem includes RAM leakage, read-write long-tail, number According to loss and system performance problems and system availability problem and service quality problem etc.;It is described to check that rule includes checkpoint And its outlier threshold of corresponding monitoring data.Certainly, other existing or described problem rule bases for being likely to occur from now on are such as The application is applicable to, should be also included within the application protection domain, and be incorporated herein by reference herein.
For example, the problem of described problem rule base is present is memory overflow, then check that rule includes corresponding to it:The inspection The rate of change and its corresponding outlier threshold, the checkpoint made an inventory of as nearly one week traffic pressure are establishment file total amount and its right The outlier threshold answered and the checkpoint are that internal memory uses growth slope and its corresponding outlier threshold;Described problem rule stock The problem of for read-write long-tail, then check that rule includes corresponding to it:The checkpoint be nearly one week read-write calling frequency and Its corresponding outlier threshold, the checkpoint are the retransmission rate of network and its corresponding outlier threshold and the checkpoint in cluster For the disk health status score information in cluster and its corresponding outlier threshold.
Further, the monitoring processing unit 13 includes:Searching unit (not shown) and data capture unit (do not show Go out), wherein, the searching unit (not shown), for based on cluster described in the cluster position information searching, and described in obtaining In cluster regular related checkpoint is checked to described;The data capture unit (not shown), for the prison from the cluster The monitoring data of the related checkpoint in the review time section is obtained in control module.
It should be noted that the monitoring module of the cluster is mainly responsible each from the monitoring system acquisition in the cluster The monitoring data of each checkpoint of hardware device and software equipment correlation.Certainly, other are existing or are from now on likely to occur The monitoring module of the cluster is such as applicable to the application, should also be included within the application protection domain, and herein with reference Mode is incorporated herein.
In above-described embodiment of the application, if the cluster position information is upper in the searching unit (not shown) Extra large geographical location information, then the Shanghai cluster is found based on the Shanghai geographical location information, and from the Shanghai cluster Middle acquisition checks regular related each checkpoint to described;Prison of the data capture unit (not shown) from the Shanghai cluster The monitoring data of related each checkpoint in the review time section is obtained in control module, then has the checkpoint got Monitoring data for establishment file total amount is 34, the checkpoint be internal memory using increase slope monitoring data be 48%, institute The monitoring data for stating the rate of change that checkpoint is nearly one week traffic pressure is 1%, and the read-write that the checkpoint is nearly one week is called The monitoring data of frequency is 75.6%, and the checkpoint is that the monitoring data of the retransmission rate of network in cluster is 5.3%, the inspection It is 100 to make an inventory of as the monitoring data of the disk health status score information in cluster.
The monitoring processing unit 13 includes:Data processing unit (not shown), wherein, the data processing unit is (not Show), for based on it is described to be checked the problem of it is corresponding check rule, the monitoring data of each checkpoint is entered respectively Row processing, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
, can be by being asked based on described to be checked in the data processing unit in above-described embodiment of the application Rule is checked corresponding to topic, is respectively compared the monitoring datas of multiple checkpoints to judge whether the problem of described to be checked. If the problem of distributed file system on anticipation line whether there is RAM leakage, can be by respectively to nearly one week business pressure The rate of change of power, establishment file total amount, internal memory are carried out accordingly using monitoring data corresponding to these three checkpoints of slope is increased The matching for checking rule, to obtain result to be prejudged;If whether the distributed file system on anticipation line deposits , then can be by calling frequency to the read-write of nearly one week respectively read and write long-tail the problem of, the retransmission rate of network, cluster in cluster In these three checkpoints of disk health status score information corresponding to monitoring data carry out corresponding for checking rule It is equipped with and obtains result to be prejudged.
If for example, in the inspection rule of RAM leakage, the checkpoint is establishment file described to be checked the problem of The outlier threshold of total amount is 30, because monitoring data that the checkpoint is establishment file total amount is 34 to have exceeded the abnormal threshold Value 30, then the checkpoint is that establishment file total amount exception occurs;The checkpoint is that internal memory uses the abnormal threshold for increasing slope It is worth for 20%, is 48% to exceed the outlier threshold because the checkpoint is internal memory using the monitoring data for increasing slope 20%, then the checkpoint is that internal memory exception occurs using slope is increased, and the checkpoint is the change of nearly one week traffic pressure The outlier threshold of rate is 14%, because the monitoring data for the rate of change that the checkpoint is nearly one week traffic pressure is 1% to be less than The outlier threshold 14%, then the checkpoint is normal for the rate of change of nearly one week traffic pressure;If asked in described to be checked In the inspection rule of entitled read-write long-tail, the outlier threshold for the read-write calling frequency that the checkpoint is nearly one week is 30%, by It is 75.6% to exceed the outlier threshold 30% that the monitoring data of frequency is called in the read-write for being nearly one week in the checkpoint, then institute Stating the read-write that checkpoint is nearly one week calls frequency exception occur, and the checkpoint is the abnormal threshold of the retransmission rate of network in cluster It is worth for 10%, because monitoring data that the checkpoint is the retransmission rate of network in cluster is 5.3% to be less than the outlier threshold 10%, then the checkpoint is that the retransmission rate of network in cluster is normal, and the checkpoint is the disk health status point in cluster The outlier threshold of value information is 60, because the monitoring data that the checkpoint is the disk health status score information in cluster is 15 are less than the outlier threshold 60, then the checkpoint is that the disk health status score information in cluster is normal, therefore is obtained Result the problem of being described to be checked be to check that the checkpoint in rule is establishment file corresponding to RAM leakage There is abnormal, described checkpoint and the problem of abnormal, described to be checked occurs using slope is increased for read-write long-tail for internal memory in total amount It is abnormal that the read-write that the corresponding checkpoint checked in rule is nearly one week calls frequency to occur.
In above-described embodiment of the application, the problem of described to be checked is based on correspondingly in the monitoring processing unit 13 Inspection rule the monitoring data of each checkpoint is handled respectively after, obtain corresponding result;Then, in institute State in early warning feedback device 14, corresponding described problem is transferred based on the result, because the result is:It is described to treat The problem of inspection is to check that the checkpoint in rule is that establishment file total amount abnormal, described inspection occurs corresponding to RAM leakage Make an inventory of and the problem of abnormal, described to be checked occur using slope is increased to check the institute in rule corresponding to read-write long-tail for internal memory Stating the read-write that checkpoint is nearly one week calls frequency exception occur, then it is RAM leakage and read-write length to transfer corresponding described problem Tail, in the early warning feedback device 14, the relevant information based on described problem generates and feeds back healthy early warning information.
Further, the relevant information of described problem includes at least any one of following:It is the time of occurrence of described problem, each The monitoring data of the related checkpoint, occur that the monitoring data abnormal checkpoint occurs during described problem.
Then above-described embodiment, the relevant information generation healthy early warning information based on described problem, then have the health pre- Alert information includes described problem and its corresponding time of occurrence and each described of monitoring data exception occurs when there is described problem Checkpoint and its monitoring data.
For example, it is based on the result:The problem of described to be checked, is in inspection rule corresponding to RAM leakage The checkpoint is that exception occurs in establishment file total amount, the checkpoint is that internal memory is abnormal using slope appearance is increased, described to treat The problem of inspection is to check that the read-write that the checkpoint in rule is nearly one week calls frequency appearance different corresponding to read-write long-tail Often, then it is RAM leakage and read-write long-tail to transfer corresponding described problem;And based on the relevant information of described problem according to described point Early warning report template in cloth file system generates and feeds back the healthy early warning information, wherein the healthy early warning generated Information is { { RAM leakage:In t1, establishment file total amount is that 34 appearance are abnormal, and in t2, internal memory is gone out using slope 48% is increased It is now abnormal };{ read-write long-tail:In t3, the read-write of nearly one week calls frequency 75.6% exception occur } }, to feed back to system maintenance Personnel, to treat the healthy early warning information each inspection to the cluster under pinpoint the problems of the system maintenance personnel based on feedback Point is given warning in advance and handles relevant health warning information, and more checkpoints are carried out to the distributed file system on line so as to improve The real-time of monitoring, and reach the early warning to multiple checkpoints in the cluster in advance and handle the mesh of healthy early warning information , also improve the degree of accuracy of the health status anticipation to each checkpoint corresponding to the described problem in cluster.
In the early warning feedback device 14, if the monitoring data of each checkpoint is all in all results The outlier threshold is not above, then generates health status information, to treat that it is whole that the distributed file system attendant understands Individual distributed file system is in health status, without carrying out healthy early warning processing.
In embodiments herein, described problem rule base is being utilized to each cluster in distributed sort system Checkpoint monitoring data carry out polymerization calculating during, it is also necessary to described problem rule base carry out create and constantly Renewal is as shown in Figure 6.
Fig. 6 shows corresponding wound in a kind of equipment for checking cluster health status according to the application another aspect Build the device structure schematic diagram of problem rule base.The equipment 1 also includes creating rules device 15 and Policy Updates device 16.
Wherein, the establishment rules device 15 creates problem rule base, and described problem rule base includes at least one problem And its corresponding inspection rule;The problem of Policy Updates device 16 is in described problem rule base and its corresponding inspection Rule is updated.
In embodiments herein, the monitoring data of each checkpoint in the distributed file system is carried out Before polymerization calculates, described problem rule base need to be created, wherein described problem rule base includes at least one problem and each institute State and rule is checked corresponding to problem, in the exception for checking rule and including at least one checkpoint and each checkpoint Threshold value, i.e., had before problem occurs multiple checkpoints occur it is abnormal, and based on occur abnormal multiple checkpoints come There is abnormal corresponding described problem in anticipation.
For example, described problem rule base includes problem 1, problem 2 and problem 3, wherein, examined corresponding to described problem 1 It is { problem 1 to look into rule:Checkpoint A outlier threshold is A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold For C1 };Check that rule is { problem 2 corresponding to described problem 2:Checkpoint D outlier threshold is D1, checkpoint E outlier threshold Outlier threshold for E1 and checkpoint F is F1 };Check that rule is { problem 3 corresponding to described problem 3:Checkpoint G abnormal threshold The outlier threshold being worth for G1 and checkpoint H is G1 }.
With exploding for user's mass data, the scale of the distributed file system is also constantly increasing, due to rule It is actual before a problem occurs during the anticipation of the health status of distributed file system of the mould on ever-increasing line Have multiple checkpoints and occur exception in advance, then need each inspection in the fixed time period before occurring according to described problem The abnormal monitoring data of point is iterated calculating, can most react corresponding check gauge when described problem occurs abnormal to find Then, as shown in Figure 3.
Fig. 7 is shown according to a kind of rule for being used to check in the equipment of cluster health status provided in the embodiment of the application one The then structural representation of updating device 16.The Policy Updates device 16 includes:First information acquiring unit 161, the first record First 162, first probability updating unit 163 and the first Policy Updates unit 164.
Wherein, the first information acquiring unit 161 obtain the relevant information of cluster to be checked, problem to be updated and its Initial monitoring threshold value;First recording unit 162 is based on the initial monitoring threshold value, is obtained from the relevant information of the cluster Take all checkpoints in the setting time section before the time of occurrence point and time of occurrence point of the problem to be updated Monitoring data, the abnormal checkpoint is determined and recorded based on the monitoring data;The first probability updating unit 163 exists When the problem to be updated occurs in each setting time section, recorded based on presently described setting time section abnormal The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated When probability of occurrence;The first Policy Updates unit 164 is based on institute of the probability of occurrence after renewal higher than setting probability Checkpoint and its relevant information are stated, updates the inspection rule of the problem to be updated.
It is first, described when being updated need in described problem rule base the problem of in embodiments herein First information acquiring unit 161 obtains the cluster position information and review time section of cluster to be checked and instructed Experienced problem to be updated and its corresponding initial monitoring threshold value;Then, first recording unit 162 is based on the initial monitoring Threshold value corresponding cluster under the cluster position information obtains the appearance of the problem to be updated in the review time section The monitoring data of all checkpoints in setting time section before time point and time of occurrence point, and by monitoring data There is the abnormal check point record to get off;Then, the first probability updating unit 163 is in each setting time section During the interior generation problem to be updated, based on the abnormal checkpoint and history recorded in presently described setting time section The abnormal checkpoint of record, update probability of occurrence of each checkpoint when the problem to be updated occurs;Most Afterwards, the first Policy Updates unit 164 based on the probability of occurrence after renewal higher than setting probability the checkpoint and Its relevant information, the inspection rule of the problem to be updated is updated, so as to the inspection rule by updating the problem to be updated To update described problem rule base so that described problem rule base more comprehensively more accurately can reflect in distributed file system Abnormal examination point, and realizing the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and improving The degree of accuracy prejudged to the health status of each checkpoint corresponding to the described problem in cluster and real-time.
Further, the initial monitoring threshold value includes:The outlier threshold of the monitoring data of all checkpoints and go out The now weight threshold of the abnormal checkpoint;First recording unit 162 is used for:Monitoring based on all checkpoints The outlier threshold of data, obtained from the relevant information of the cluster problem to be updated time of occurrence point and it is described go out The monitoring data of all checkpoints in setting time section before existing time point, and the weight of the checkpoint of recording exceptional Corresponding checkpoint during more than the weight threshold, wherein, the checkpoint of the weight of the checkpoint based on exception Probability of occurrence determine.
It should be noted that when the problem to be updated occurs, the probability of occurrence of the checkpoint and the checkpoint Weight computational methods it is as follows.If the problem to be updated 1 occurs 1000 times in the setting time section, the checkpoint A occurs 654 times, and the checkpoint B occurs 252 times, and the checkpoint C occurs 94 times, then has the probability of occurrence of the checkpoint A For 65.4%, the probability of occurrence of the checkpoint B is 25.2%, and the probability of occurrence of the checkpoint C is 9.4%, wherein, it is described Checkpoint A weight is:65.4%/(65.4%+25.2%+9.4%)=65.4%, the weight of the checkpoint B are: 25.2%/(65.4%+25.2%+9.4%)=25.2%, the weight of the checkpoint C are:9.4%/(65.4%+25.2% + 9.4%)=9.4%.Certainly, other existing or the probability of occurrence of the checkpoint being likely to occur from now on and described inspections The computational methods of the weight of point are such as applicable to the application, should also be included within the application protection domain, and herein with reference Mode is incorporated herein.
Preferably, first recording unit 162 includes:Judgment sub-unit (not shown) and record subelement (do not show Go out), wherein, the judgment sub-unit (not shown), for judging whether the monitoring data of the checkpoint exceeds outlier threshold; The record subelement (not shown), for if the checkpoint that corresponding exception is determined and recorded if.
Check that rule is { problem 1 corresponding to the problem to be updated 1 in described problem database for example, obtaining:Checkpoint A Outlier threshold be A1, checkpoint B outlier threshold is B1 and checkpoint C outlier threshold is C1 and the checkpoint Weight threshold is 10%, is treated more described in the cluster position information checked based on needs and review time section acquisition Setting time section (t+ △ t) and (t+2 △ t) before the time of occurrence point t and time of occurrence point t of new problem 1 The monitoring data of all checkpoints corresponding to interior difference, and whether the monitoring data based on the checkpoint exceeds abnormal threshold Value records the checkpoint of corresponding exception.If in the setting time section (t+ △ t) before the time of occurrence point t, The monitoring data of the checkpoint obtained has respectively beyond its corresponding outlier threshold:Checkpoint A, checkpoint B and inspection C is made an inventory of, wherein, the probability of occurrence of the checkpoint A is 65.4% in the setting time section (t+ △ t), the checkpoint B Probability of occurrence be 25.2%, the probability of occurrence of the checkpoint C is 9.4%, and weight calculation is carried out based on the probability of occurrence, Then the weight of the checkpoint is the probability of occurrence and the probability of occurrence sum of all checkpoints of each checkpoint Ratio information, the weight for obtaining the checkpoint A in the setting time section (t+ △ t) are 65.4%, the checkpoint B's Weight is 25.2%, and the weight of the checkpoint C is 9.4%, and the weight threshold based on the checkpoint is 10%, so in institute State in setting time section (t+ △ t), the weight of the checkpoint of recording exceptional exceed it is corresponding during the weight threshold described in Checkpoint is the checkpoint A and its weight 65.4% and the checkpoint B and its weight 25.2%.If in the time of occurrence In setting time section (t+2 △ t) before point t, the monitoring data of the checkpoint of acquisition is beyond abnormal threshold corresponding to it Value has respectively:Checkpoint A, checkpoint B and checkpoint D, wherein, the inspection in the setting time section (t+2 △ t) Point A probability of occurrence is 50.5%, and the probability of occurrence of the checkpoint B is 1.4%, and the probability of occurrence of the checkpoint D is 48.1%, weight calculation is carried out based on the probability of occurrence, obtains the checkpoint A in the setting time section (t+2 △ t) Weight be 50.5%, the weight of the checkpoint B is 1.4%, and the weight of the checkpoint D is 48.1%, based on the inspection The weight threshold made an inventory of is 10%, so in the setting time section (t+2 △ t), the power of the checkpoint of recording exceptional The corresponding checkpoint is the checkpoint A and its weight 50.5% and the checkpoint D when exceeding the weight threshold again And its weight 48.1%.
Further, the first probability updating unit 163 includes:Weight determination subelement (not shown) and probability updating Subelement (not shown), wherein, the weight determination subelement (not shown), for occurring in each setting time section During the problem to be updated, the probability of occurrence of the abnormal checkpoint recorded based on presently described setting time section is determined The present weight of the checkpoint in presently described setting time section;The probability updating subelement (not shown), for base In the checkpoint present weight and historical record the abnormal checkpoint history weight, update each inspection Probability of occurrence of the point when the problem to be updated occurs.
Then above-described embodiment of the application, if the problem to be updated 1 is when the preceding current setting of exception occurs in moment t Between in section (t+ △ t), the weight of the checkpoint of recording exceptional the corresponding checkpoint is when exceeding the weight threshold The checkpoint A and its present weight 65.4% and the checkpoint B and its present weight 25.2%, wherein, the checkpoint Probability of occurrence of the present weight based on the checkpoint determine;If the problem to be updated 1 occurs in moment t, exception is preceding to be gone through In setting time section described in history (t+2 △ t), the weight of the checkpoint of recording exceptional exceedes corresponding during the weight threshold The checkpoint is the checkpoint A and its history weight 50.5% and the checkpoint D and its history weight 48.1%;Then base Present weight and history weight in each checkpoint of the problem 1 to be updated, each checkpoint is updated in institute Probability of occurrence when problem to be updated occurs is stated, that is, updates synthesis of each checkpoint when the problem to be updated occurs Weight, wherein, the comprehensive weight of the checkpoint is the present weight of the checkpoint and the average of history weight, then has described Checkpoint A comprehensive weight corresponding to problem to be updated is (65.4%+50.5%)/2=57.95%, and the checkpoint B's is comprehensive Conjunction weight is (25.2%+1.4%)/2=13.3%, and the comprehensive weight of the checkpoint D is (0+48.1%)/2= 24.05%, then the history weight of the checkpoint of weight and historical record in the working days based on the checkpoint, renewal is each Probability of occurrence of the checkpoint when the problem 1 to be updated occurs, that is, the probability of occurrence of the checkpoint A after updating are 57.95%, the checkpoint B after renewal probability of occurrence are 13.3%, and the probability of occurrence of the checkpoint D after renewal is 24.05%.
Then above-described embodiment of the application, the setting probability and the weight of the checkpoint in the step S164 The numerical value of threshold value is consistent, i.e., the described probability that sets is 10%, due to checkpoint A and its renewal corresponding to the problem 1 to be updated Probability of occurrence 57.95% afterwards higher than it is described setting probability 10%, checkpoint B and its renewal after probability of occurrence 13.3% be higher than Probability of occurrence 24.05% after the setting probability 10%, checkpoint D and its renewal then will higher than the setting probability 10% The checkpoint C abandons from the inspection rule of the problem 1 to be updated in described problem rule base, by the checkpoint and Its corresponding outlier threshold is added in the inspection rule of the problem to be updated in described problem rule base, and based on described in more The probability of occurrence after new higher than the setting probability the checkpoint A, the checkpoint B and the checkpoint D and its Relevant information, update the inspection rule of the problem to be updated.
Further, the relevant information of the checkpoint includes following at least any one:The monitoring data of the checkpoint Outlier threshold, the weight of the checkpoint, wherein, probability of occurrence of the weight of the checkpoint based on the checkpoint is true It is fixed.
Then above-described embodiment of the application, by the checkpoint A and its outlier threshold A1 of corresponding monitoring data and The outlier threshold B1 and weight 13.3% of weight 57.95%, the checkpoint B and its corresponding monitoring data, and the inspection The outlier threshold D1 and weight 24.05% of point D and its corresponding monitoring data enter as the inspection rule of the problem 1 to be updated Row renewal.
With carrying out constantly health examination and obtaining healthy warning information to the distributed file system, and it is based on institute State healthy early warning information carry out advanced processing during, user can get more than one to the distributed file system Inspection result information after being handled, multiple checkpoints are actually had before a problem occurs, and appearance is abnormal in advance, Then need the inspection result information based on acquisition, each checkpoint in the fixed time period before being occurred according to described problem Abnormal monitoring data is iterated calculating, can most react corresponding inspection rule when described problem occurs abnormal to find, such as Shown in Fig. 8.
Fig. 8 is shown according in a kind of equipment for checking cluster health status provided in the application another embodiment The structural representation of Policy Updates device 16.The Policy Updates device 16 includes:Second information acquisition unit 165, second records Unit 166, the second probability updating unit 167 and Second Rule updating block 168.
Wherein, second information acquisition unit 165 obtains problem to be updated, and believes from least one inspection result The time of occurrence point of the problem to be updated is obtained in breath;Before second recording unit 166 obtains the time of occurrence point The monitoring data of all checkpoints in setting time section, the abnormal inspection is determined and recorded based on the monitoring data Point;The second probability updating unit 167 based on the abnormal checkpoint that is recorded in presently described setting time section with The abnormal checkpoint of historical record, it is general to update appearance of each checkpoint when the problem to be updated occurs Rate;The Second Rule updating block 168 based on the probability of occurrence after renewal higher than setting probability the checkpoint and Its relevant information, update the inspection rule of the problem to be updated.
It should be noted that the inspection result information includes at least following any one:There is abnormal described problem, go out The time of occurrence point of existing described problem, there is the corresponding checkpoint for exception occur and its outlier threshold during described problem.Certainly, its He is such as applicable to the application at the inspection result information that is existing or being likely to occur from now on, should also be included in the application and protect Within scope, and it is incorporated herein by reference herein.
It is first, described when being updated need in described problem rule base the problem of in embodiments herein Second information acquisition unit 165 obtains problem to be updated, and is treated more described in acquisition from least one inspection result information The time of occurrence point of new problem;Then, second recording unit 166 is obtained in the setting time section before the time of occurrence point The monitoring data of all checkpoints, monitoring data and its outlier threshold based on the checkpoint determine and record abnormal The checkpoint;Then, the second probability updating unit 167 is abnormal based on being recorded in presently described setting time section The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated When probability of occurrence;Finally, the Second Rule updating block 168 is higher than setting probability based on the probability of occurrence after renewal The checkpoint and its relevant information, the inspection rule of the problem to be updated is updated, so as to pass through at least one of acquisition The inspection result information updates the inspection rule of the problem to be updated to update described problem rule base so that described to ask The abnormal examination point that rule base more comprehensively more accurately can reflect in distributed file system is inscribed, and is realized to being asked described in appearance The monitoring of the health status of corresponding multiple checkpoints during topic, and improve to each inspection corresponding to the described problem in cluster The degree of accuracy of the health status anticipation of point and real-time.
In addition, present invention also provides a kind of equipment for checking cluster health status, including:
Processor;
And it is arranged to store the memory of computer executable instructions, the executable instruction makes institute when executed State processor:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the prison for checking regular related checkpoint Data are controlled, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back strong Health warning information.
Compared with prior art, a kind of of embodiments herein offer is used to check the method for cluster health status and set It is standby, by the relevant information for obtaining cluster to be checked;Obtain the problem of at least one to be checked and its corresponding inspection rule; Based on the relevant information of the cluster, the monitoring data to the regular related checkpoint of the inspection is obtained from the cluster, And polymerization processing is carried out to the monitoring data to obtain result;Corresponding described problem is transferred based on the result, And the relevant information based on described problem generates and feeds back healthy early warning information.Because the distributed file system on to line is entered Before the anticipation of row health status, will occur the problem of abnormal to be checked on the line of distributed file system as far as possible is carried out correspondingly Regularization corresponding check rule to obtain the problem of described to be checked so that distributed file system on to line is carried out When health status prejudges, monitoring data corresponding to each checkpoint can be directly obtained, and check rule to the inspection using described The monitoring data made an inventory of carries out result of the polymerization processing to be handled, and improves and multiple inspections under each clustered node are clicked through The degree of accuracy of row health status monitoring, and corresponding described problem, and the related letter based on described problem are transferred based on result Breath generates and feeds back healthy early warning information, to treat the healthy early warning information of the attendant based on feedback under each clustered node The each checkpoint pinpointed the problems gives warning in advance and handles relevant health warning information, so as to improve to the distribution text on line Part system carries out the real-time of more checkpoint monitoring, and reaches the purpose of Multiple Dots Alarm Rules in advance;Further, to the monitoring number Included according to polymerization processing is carried out with obtaining result:Based on it is described to be checked the problem of it is corresponding check rule, to each institute The monitoring data for stating checkpoint is respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and anti- Result being presented, realizing the monitoring of the health status of corresponding multiple checkpoints during to described problem occur, and is improved pair The degree of accuracy of the health status anticipation of each checkpoint corresponding to described problem in cluster.
Further, a kind of method and apparatus for checking cluster health status that embodiments herein provides, also By creating problem rule base, described problem rule base includes at least one problem and its corresponding inspection rule;Ask described The problem of in topic rule base and its corresponding inspection rule are updated, and ensure that to most in the distributed file system on line The each checkpoint for being likely to occur problem check the establishment of rule, and the monitoring data based on each checkpoint is to described problem The problem of in rule base and its corresponding inspection rule are updated, with ensure create the problem of rule base in can be more comprehensively Abnormal examination point in more accurate reaction profile formula file system, and realize corresponding multiple inspections during to described problem occur The monitoring of the health status of point, and improve the health status anticipation to each checkpoint corresponding to the described problem in cluster The degree of accuracy and real-time.
Further, in described problem rule base the problem of and its corresponding inspection rule be updated including:Obtain Take the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;Based on the initial monitoring threshold value, from institute State the setting time before time of occurrence point and time of occurrence point that the problem to be updated is obtained in the relevant information of cluster The monitoring data of all checkpoints, the abnormal checkpoint is determined and recorded based on the monitoring data in section;Every It is abnormal based on being recorded in presently described setting time section when the problem to be updated occurring in the individual setting time section The checkpoint and the abnormal checkpoint of historical record, update each checkpoint and occur in the problem to be updated When probability of occurrence;It is higher than the checkpoint and its relevant information of setting probability based on the probability of occurrence after renewal, more The inspection rule of the new problem to be updated so that pass through the setting time section before the time of occurrence point to the problem to be updated The monitoring data of interior all checkpoints, prejudged based on the initial monitoring threshold value, and in each setting Between when the problem to be updated occurring in section, based on the abnormal checkpoint recorded in presently described setting time section with The abnormal checkpoint of historical record, it is general to update appearance of each checkpoint when the problem to be updated occurs Rate, so as to which the probability of occurrence after renewal to be higher than to the checkpoint and its relevant information of setting probability, treated described in renewal The inspection rule of replacement problem, so as to update described problem rule base by updating the inspection rule of the problem to be updated, The abnormal examination point that described problem rule base is more comprehensively more accurately reflected in distributed file system, and realization pair There is the monitoring of the health status of corresponding multiple checkpoints during described problem, and improve corresponding to the described problem in cluster Each checkpoint health status anticipation the degree of accuracy and real-time.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt With application specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, the software program of the application can realize steps described above or function by computing device.Similarly, the application Software program (including related data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, Magnetically or optically driver or floppy disc and similar devices.In addition, some steps or function of the application can employ hardware to realize, example Such as, coordinate as with processor so as to perform the circuit of each step or function.
In addition, the part of the application can be applied to computer program product, such as computer program instructions, when its quilt When computer performs, by the operation of the computer, it can call or provide according to the present processes and/or technical scheme. And the programmed instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal bearing medias and be transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, including a device according to one embodiment of the application, the device includes using Memory in storage computer program instructions and processor for execute program instructions, wherein, when the computer program refers to When order is by the computing device, method and/or skill of the plant running based on foregoing multiple embodiments according to the application are triggered Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, scope of the present application is by appended power Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the application.Any reference in claim should not be considered as to the involved claim of limitation.This Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple Unit or device can also be realized by a unit or device by software or hardware.The first, the second grade word is used for table Show title, and be not offered as any specific order.

Claims (29)

1. a kind of method for checking cluster health status, wherein, methods described includes:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the monitoring number for checking regular related checkpoint According to, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back healthy pre- Alert information.
2. the method according to claim 11, wherein, it is described to obtain the problem of at least one to be checked and its corresponding inspection Rule includes:
The problem of at least one to be checked and its corresponding inspection rule are obtained from problem rule base.
3. according to the method for claim 1, wherein, the relevant information for obtaining cluster to be checked includes:
The request submitted based on user, the relevant information of cluster to be checked is obtained, wherein, the relevant information includes:Cluster Positional information and review time section.
4. according to the method for claim 3, wherein, described obtained from the cluster checks regular related inspection to described The monitoring data made an inventory of includes:
Based on cluster described in the cluster position information searching, and obtain in the cluster and check regular related inspection to described Point;
The monitoring data of the related checkpoint in the review time section is obtained from the monitoring module of the cluster.
5. the method according to claim 11, wherein, it is described that polymerization processing is carried out to the monitoring data to obtain processing knot Fruit includes:
Based on it is described to be checked the problem of it is corresponding check that rule is respectively processed to the monitoring data of each checkpoint, To obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
6. according to the method for claim 1, wherein, the relevant information of described problem includes at least following any one:
The time of occurrence of described problem, the monitoring data of each correlation checkpoint, monitoring number occurs when there is described problem According to the checkpoint of exception.
7. according to the method for claim 1, wherein, methods described also includes:
Establishment problem rule base, described problem rule base include at least one problem and its corresponding inspection rule;
The problem of in described problem rule base and its corresponding inspection rule are updated.
8. according to the method for claim 7, wherein, the problem of in the rule base to described problem and its corresponding inspection Look into rule be updated including:
Obtain the relevant information of cluster to be checked, problem to be updated and its initially monitor threshold value;
Based on the initial time of occurrence for monitoring threshold value, the problem to be updated being obtained from the relevant information of the cluster The monitoring data of all checkpoints in setting time section before point and time of occurrence point, based on the monitoring data Determine and record the abnormal checkpoint;
When the problem to be updated occurs in each setting time section, based on being recorded in presently described setting time section The abnormal checkpoint and historical record the abnormal checkpoint, update each checkpoint described to be updated Probability of occurrence when problem occurs;
Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, treated more described in renewal The inspection rule of new problem.
9. according to the method for claim 8, wherein, the initial monitoring threshold value includes:The monitoring of all checkpoints The outlier threshold of data and the weight threshold for the abnormal checkpoint occur;
It is described initially to monitor threshold value based on described, when the appearance of the problem to be updated is obtained from the relevant information of the cluster Between in setting time section before point and time of occurrence point all checkpoints monitoring data, based on the monitoring number Include according to the checkpoint for determining and recording abnormal:
The outlier threshold of monitoring data based on all checkpoints, treated more described in acquisition from the relevant information of the cluster The monitoring data of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of new problem, And the weight of the checkpoint of recording exceptional exceedes the corresponding checkpoint during weight threshold, wherein, the inspection The probability of occurrence of the checkpoint of the weight of point based on exception determines.
10. according to the method for claim 8, wherein, it is described based on the monitoring data determine and record it is abnormal described in Checkpoint includes:
Judge whether the monitoring data of the checkpoint exceeds outlier threshold;
If the checkpoint that corresponding exception is determined and recorded if.
11. according to the method for claim 7, wherein, it is the problem of in the rule base to described problem and its corresponding Check rule be updated including:
Problem to be updated is obtained, and the time of occurrence of the problem to be updated is obtained from least one inspection result information Point;
The monitoring data of all checkpoints in the setting time section before the time of occurrence point is obtained, based on the monitoring number According to determining and recording the abnormal checkpoint;
Based on the abnormal checkpoint recorded in presently described setting time section and the abnormal inspection of historical record Make an inventory of, update probability of occurrence of each checkpoint when the problem to be updated occurs;
Based on the checkpoint of the probability of occurrence after renewal higher than setting probability and its relevant information, treated more described in renewal The inspection rule of new problem.
12. the method according to any one of claim 8 to 11, wherein, the relevant information of the checkpoint is including following Any one of at least:
The outlier threshold of the monitoring data of the checkpoint, the weight of the checkpoint, wherein, the weight of the checkpoint is based on The probability of occurrence of the checkpoint determines.
13. the method according to any one of claim 8 to 12, wherein, it is described to be sent out in each setting time section During the raw problem to be updated, based on the abnormal checkpoint and historical record recorded in presently described setting time section The abnormal checkpoint, updating the probability of occurrence of each checkpoint when the problem to be updated occurs includes:
When the problem to be updated occurs in each setting time section, recorded based on presently described setting time section The probability of occurrence of the abnormal checkpoint determines the present weight of the checkpoint in presently described setting time section;
The history weight of the abnormal checkpoint of present weight and historical record based on the checkpoint, updates each institute State probability of occurrence of the checkpoint when the problem to be updated occurs.
14. the method according to any one of claim 1 to 13, wherein, the checkpoint includes following at least any one:
The local module of hardware device in the cluster, the software equipment in the cluster.
15. a kind of equipment for checking cluster health status, wherein, the equipment includes:
Information acquisition device, for obtaining the relevant information of cluster to be checked;
Rule device, for obtaining the problem of at least one to be checked and its corresponding inspection rule;
Processing unit is monitored, for the relevant information based on the cluster, is obtained from the cluster and checks regular phase with described The monitoring data of the checkpoint of pass, and polymerization processing is carried out to the monitoring data to obtain result;
Early warning feedback device, for transferring corresponding described problem, and the related letter based on described problem based on the result Breath generates and feeds back healthy early warning information.
16. equipment according to claim 15, wherein, the Rule device is used for:
The problem of at least one to be checked and its corresponding inspection rule are obtained from problem rule base.
17. equipment according to claim 15, wherein, described information acquisition device is used for:
The request submitted based on user, the relevant information of cluster to be checked is obtained, wherein, the relevant information includes:Cluster Positional information and review time section.
18. equipment according to claim 17, wherein, the monitoring processing unit includes:
Searching unit, for based on cluster described in the cluster position information searching, and obtain in the cluster with the inspection Regular related checkpoint;
Data capture unit, for obtaining the related checkpoint in the review time section from the monitoring module of the cluster Monitoring data.
19. equipment according to claim 15, wherein, the monitoring processing unit includes:
Data processing unit, for based on it is described to be checked the problem of corresponding check monitoring of the rule to each checkpoint Data are respectively processed, to obtain the abnormal checkpoint of at least one generation monitoring data and feedback processing result.
20. equipment according to claim 15, wherein, the relevant information of described problem includes at least following any one:
The time of occurrence of described problem, the monitoring data of each correlation checkpoint, monitoring number occurs when there is described problem According to the checkpoint of exception.
21. equipment according to claim 15, wherein, the equipment also includes:
Rules device is created, for creating problem rule base, described problem rule base includes at least one problem and its corresponding Check rule;
Policy Updates device, the problem of in described problem rule base and its corresponding inspection rule is updated.
22. equipment according to claim 21, wherein, the Policy Updates device includes:
First information acquiring unit, for obtaining the relevant information of cluster to be checked, problem to be updated and its initially monitoring threshold Value;
First recording unit, for based on the initial monitoring threshold value, being treated more described in acquisition from the relevant information of the cluster The monitoring data of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of new problem, The abnormal checkpoint is determined and recorded based on the monitoring data;
First probability updating unit, during for the problem to be updated to occur in each setting time section, based on current The abnormal checkpoint recorded in the setting time section and the abnormal checkpoint of historical record, renewal are each Probability of occurrence of the checkpoint when the problem to be updated occurs;
First Policy Updates unit, for based on the probability of occurrence after renewal higher than setting probability the checkpoint and its Relevant information, update the inspection rule of the problem to be updated.
23. equipment according to claim 22, wherein, the initial monitoring threshold value includes:The prison of all checkpoints Control the outlier threshold of data and the weight threshold of the abnormal checkpoint occur;
First recording unit is used for:
The outlier threshold of monitoring data based on all checkpoints, treated more described in acquisition from the relevant information of the cluster The monitoring data of all checkpoints in setting time section before the time of occurrence point and time of occurrence point of new problem, And the weight of the checkpoint of recording exceptional exceedes the corresponding checkpoint during weight threshold, wherein, the inspection The probability of occurrence of the checkpoint of the weight of point based on exception determines.
24. equipment according to claim 22, wherein, first recording unit includes:
Judgment sub-unit, for judging whether the monitoring data of the checkpoint exceeds outlier threshold;
Subelement is recorded, for if the checkpoint that corresponding exception is determined and recorded if.
25. equipment according to claim 21, wherein, the Policy Updates device includes:
Second information acquisition unit, for obtaining problem to be updated, and institute is obtained from least one inspection result information State the time of occurrence point of problem to be updated;
Second recording unit, for obtaining the monitoring number of all checkpoints in the setting time section before the time of occurrence point According to determining and recording the abnormal checkpoint based on the monitoring data;
Second probability updating unit, for based on the abnormal checkpoint recorded in presently described setting time section with going through The abnormal checkpoint of Records of the Historian record, update probability of occurrence of each checkpoint when the problem to be updated occurs;
Second Rule updating block, for based on the probability of occurrence after renewal higher than setting probability the checkpoint and its Relevant information, update the inspection rule of the problem to be updated.
26. the equipment according to any one of claim 22 to 25, wherein, the relevant information of the checkpoint is including following Any one of at least:
The outlier threshold of the monitoring data of the checkpoint, the weight of the checkpoint, wherein, the weight of the checkpoint is based on The probability of occurrence of the checkpoint determines.
27. the equipment according to any one of claim 22 to 26, wherein, the first probability updating unit includes:
Weight determination subelement, during for the problem to be updated to occur in each setting time section, based on current institute State the abnormal checkpoint that setting time section is recorded probability of occurrence determine it is described in presently described setting time section The present weight of checkpoint;
Probability updating subelement, for the abnormal checkpoint of the present weight based on the checkpoint and historical record History weight, update probability of occurrence of each checkpoint when the problem to be updated occurs.
28. the equipment according to any one of claim 15 to 27, wherein, the checkpoint includes following at least any :
The local module of hardware device in the cluster, the software equipment in the cluster.
29. a kind of equipment for checking cluster health status, wherein, including:
Processor;
And it is arranged to store the memory of computer executable instructions, the executable instruction makes the place when executed Manage device:
Obtain the relevant information of cluster to be checked;
Obtain the problem of at least one to be checked and its corresponding inspection rule;
Based on the relevant information of the cluster, obtained from the cluster and the monitoring number for checking regular related checkpoint According to, and polymerization processing is carried out to the monitoring data to obtain result;
Corresponding described problem is transferred based on the result, and the relevant information based on described problem is generated and fed back healthy pre- Alert information.
CN201710205541.1A 2016-03-31 2017-03-31 Method and equipment for checking health state of cluster Active CN107391335B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610194499 2016-03-31
CN2016101944993 2016-03-31

Publications (2)

Publication Number Publication Date
CN107391335A true CN107391335A (en) 2017-11-24
CN107391335B CN107391335B (en) 2021-09-03

Family

ID=60338371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710205541.1A Active CN107391335B (en) 2016-03-31 2017-03-31 Method and equipment for checking health state of cluster

Country Status (1)

Country Link
CN (1) CN107391335B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255676A (en) * 2018-01-15 2018-07-06 南京市城市规划编制研究中心 A kind of monitoring method of software systems client health degree
CN108874640A (en) * 2018-05-07 2018-11-23 北京京东尚科信息技术有限公司 A kind of appraisal procedure and device of clustering performance
CN109376043A (en) * 2018-10-18 2019-02-22 郑州云海信息技术有限公司 A kind of method and apparatus of equipment monitoring
CN110069393A (en) * 2019-03-11 2019-07-30 北京互金新融科技有限公司 Detection method, device, storage medium and the processor of software environment
CN110278133A (en) * 2019-07-31 2019-09-24 中国工商银行股份有限公司 Inspection method, device, calculating equipment and the medium executed by server
CN113645525A (en) * 2021-08-09 2021-11-12 中国工商银行股份有限公司 Method, device, equipment and storage medium for checking running state of optical fiber switch

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123521A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A management method for check points in cluster
US20090132864A1 (en) * 2005-10-28 2009-05-21 Garbow Zachary A Clustering process for software server failure prediction
US20120254669A1 (en) * 2011-04-04 2012-10-04 Microsoft Corporation Proactive failure handling in database services
CN102957563A (en) * 2011-08-16 2013-03-06 中国石油化工股份有限公司 Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system
CN104917627A (en) * 2015-01-20 2015-09-16 杭州安恒信息技术有限公司 Log cluster scanning and analysis method used for large-scale server cluster
CN104954181A (en) * 2015-06-08 2015-09-30 北京集奥聚合网络技术有限公司 Method for warning faults of distributed cluster devices

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132864A1 (en) * 2005-10-28 2009-05-21 Garbow Zachary A Clustering process for software server failure prediction
CN101123521A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A management method for check points in cluster
US20120254669A1 (en) * 2011-04-04 2012-10-04 Microsoft Corporation Proactive failure handling in database services
CN102957563A (en) * 2011-08-16 2013-03-06 中国石油化工股份有限公司 Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system
CN104917627A (en) * 2015-01-20 2015-09-16 杭州安恒信息技术有限公司 Log cluster scanning and analysis method used for large-scale server cluster
CN104954181A (en) * 2015-06-08 2015-09-30 北京集奥聚合网络技术有限公司 Method for warning faults of distributed cluster devices

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255676A (en) * 2018-01-15 2018-07-06 南京市城市规划编制研究中心 A kind of monitoring method of software systems client health degree
CN108874640A (en) * 2018-05-07 2018-11-23 北京京东尚科信息技术有限公司 A kind of appraisal procedure and device of clustering performance
CN109376043A (en) * 2018-10-18 2019-02-22 郑州云海信息技术有限公司 A kind of method and apparatus of equipment monitoring
CN110069393A (en) * 2019-03-11 2019-07-30 北京互金新融科技有限公司 Detection method, device, storage medium and the processor of software environment
CN110278133A (en) * 2019-07-31 2019-09-24 中国工商银行股份有限公司 Inspection method, device, calculating equipment and the medium executed by server
CN110278133B (en) * 2019-07-31 2021-08-13 中国工商银行股份有限公司 Checking method, device, computing equipment and medium executed by server
CN113645525A (en) * 2021-08-09 2021-11-12 中国工商银行股份有限公司 Method, device, equipment and storage medium for checking running state of optical fiber switch

Also Published As

Publication number Publication date
CN107391335B (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN107391335A (en) A kind of method and apparatus for checking cluster health status
US9275353B2 (en) Event-processing operators
US9547970B2 (en) Context-aware wearable safety system
US11392469B2 (en) Framework for testing machine learning workflows
CN109905269A (en) The method and apparatus for determining network failure
CN110069551A (en) Medical Devices O&M information excavating analysis system and its application method based on Spark
JP2009070071A (en) Learning process abnormality diagnostic device and operator's judgement estimation result collecting device
JP2012009064A (en) Learning type process abnormality diagnosis device and operator determination assumption result collection device
US20160048805A1 (en) Method of collaborative software development
Goel et al. A data-driven alarm and event management framework
Jafarian-Namin et al. An integrated quality, maintenance and production model based on the delayed monitoring under the ARMA control chart
DE112019005467T5 (en) SYSTEM AND METHOD OF DETECTING AND PREDICTING PATTERNS OF ANOMALY SENSOR BEHAVIOR OF A MACHINE
Pan et al. Google trends analysis of covid-19 pandemic
CN114118507A (en) Risk assessment early warning method and device based on multi-dimensional information fusion
Arakelian et al. Creation of predictive analytics system for power energy objects
Swiecki et al. Does order matter? investigating sequential and cotemporal models of collaboration
US11887465B2 (en) Methods, systems, and computer programs for alarm handling
US20180285758A1 (en) Methods for creating and analyzing dynamic trail networks
David et al. Toward the incorporation of temporal interaction analysis techniques in modeling and understanding sociotechnical systems
Borissova et al. A concept of intelligent e-maintenance decision making system
CN116545867A (en) Method and device for monitoring abnormal performance index of network element of communication network
Pegoraro Process mining on uncertain event data
TWM592123U (en) Intelligent system for inferring system or product quality abnormality
CN114676021A (en) Job log monitoring method and device, computer equipment and storage medium
JP2013182471A (en) Load evaluation device for plant operation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant