CN108599977A - System and method based on statistical method monitoring system availability - Google Patents

System and method based on statistical method monitoring system availability Download PDF

Info

Publication number
CN108599977A
CN108599977A CN201810150782.5A CN201810150782A CN108599977A CN 108599977 A CN108599977 A CN 108599977A CN 201810150782 A CN201810150782 A CN 201810150782A CN 108599977 A CN108599977 A CN 108599977A
Authority
CN
China
Prior art keywords
error
abnormal
fnum
alarm
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810150782.5A
Other languages
Chinese (zh)
Other versions
CN108599977B (en
Inventor
梅存兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tu Niu Science And Technology Ltd
Original Assignee
Nanjing Tu Niu Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tu Niu Science And Technology Ltd filed Critical Nanjing Tu Niu Science And Technology Ltd
Priority to CN201810150782.5A priority Critical patent/CN108599977B/en
Priority claimed from CN201810150782.5A external-priority patent/CN108599977B/en
Publication of CN108599977A publication Critical patent/CN108599977A/en
Application granted granted Critical
Publication of CN108599977B publication Critical patent/CN108599977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance or administration or management of packet switching networks
    • H04L41/06Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms
    • H04L41/0631Alarm or event or notifications correlation; Root cause analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance or administration or management of packet switching networks
    • H04L41/06Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms
    • H04L41/0631Alarm or event or notifications correlation; Root cause analysis
    • H04L41/065Alarm or event or notifications correlation; Root cause analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance or administration or management of packet switching networks
    • H04L41/06Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms
    • H04L41/0681Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms involving configuration of triggering conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance or administration or management of packet switching networks
    • H04L41/06Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms
    • H04L41/069Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms involving storage or log of alarms or notifications or post-processing thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance or administration or management of packet switching networks
    • H04L41/22Arrangements for maintenance or administration or management of packet switching networks using GUI [Graphical User Interface]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing packet switching networks
    • H04L43/08Monitoring based on specific metrics
    • H04L43/0805Availability
    • H04L43/0817Availability functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing packet switching networks
    • H04L43/16Arrangements for monitoring or testing packet switching networks using threshold monitoring

Abstract

The present invention proposes the system and method based on statistical method monitoring system availability, and system includes:Service call journal module, alarm threshold value analysis module, alert analysis module, monitoring alarm display module between system;By the calling daily record between acquisition system, analytic learning periodically is carried out to historical data, obtains the general performance of each system;Data in a nearest unit interval t are analyzed, distinguish whether the current error number of each system unusual, the error rate called between system whether abnormal, system respectively service each example availability it is whether abnormal;And go out call relation between abnormal system, abnormal system on system topological figure to alert formal notation.When showing warning information, the present invention shows the state of the state called between system mode, system, system service and example on system topological figure, quickly to orient problem system when something goes wrong in large area system.

Description

System and method based on statistical method monitoring system availability
Technical field
The invention belongs to software systems monitoring technology field, be related to a kind of be based on statistical method monitoring system availability System and method.
Background technology
Internet enterprises have generally comprised a large amount of application system, and in addition to the website of opening, APP etc., inside also can There are operation, the management of many application system supports enterprise.Generally there are more complex call relation between internal application system, one The function that a system is supplied to another system to call is referred to as to service.The availability monitor industry of application system generally take with Lower means:
Method one:Using tools such as zabbix, certain some index of monitoring system server, such as:Web system into number of passes/ Thread Count, cpu load, free memory, http abnormalities number of codes, request response time etc..When index is more than given threshold Shi Jinhang alarms.
Method two:Simulant-client carries out the indexs such as periodically invoked, detection service end system responds content, speed It is no to meet given threshold.It alarms when index is more than given threshold.
But existing monitor mode has a variety of defects:
1. the threshold value in method one and method two is required for manually setting, the threshold value of different system is multifarious, same system Different times threshold value of uniting is also completely different, and the setting and maintenance of threshold value have prodigious workload.It is general to use in practical operation Trial-and-error method, that is, report by mistake after relax threshold value, fail to report after tighten threshold value, such rate of false alarm, rate of failing to report are all very high.
2. the monitoring of method one can only partial reaction availability, and cannot function as actual approve- useful index, detected Exception do not represent system availability reduce, system it is unavailable when also not all react on these monitor control indexes.
3. availability has directly been reacted in the monitoring of method two, but its as sampling observation means sample size less, covering surface compared with It is narrow, be only capable of monitoring read operation and it is less be used for write operation.
4. when system is more, more complex, the index of above two monitoring method is excessive, alarm quantity is more, alarm noise Greatly, the judgement and positioning of problem can be influenced.
5. when new system is reached the standard grade, new service is reached the standard grade, system and service arrangement change, above two monitoring method It is required for manual maintenance monitored item, is not suitable for having failure to automatically switch, the system of Dynamic expansion service ability.
6. when carrying out error rate monitoring alarm, threshold method often results in wrong report, such as when error rate requirements are no more than 1% When, if once-through operation, which only has occurred, and has failed (error rate 100%) to alert, but it is not necessarily to alarm in most cases.
7. the multiple systems of complication system cluster break down simultaneously, it is difficult to quickly orient really break down be System, can only beard eyebrow tackle all problems at once, waste valuable time.
Invention content
To solve the above problems, the present invention proposes the system and method based on statistical method monitoring system availability, lead to The calling daily record between acquisition system is crossed, analytic learning periodically is carried out to historical data, obtains the general performance of each system;To most Data in a nearly unit interval t are analyzed, and distinguish what whether each system current error number was called between abnormality, system Whether whether extremely abnormal, system respectively services the availability of each example to error rate;And on system topological figure in the form of alerting mark Remember and call relation between abnormal system, abnormal system.
In order to achieve the above object, the present invention provides the following technical solutions:
Based on the system of statistical method monitoring system availability, including:Service call journal module, alarm threshold value between system Analysis module, alert analysis module, monitoring alarm display module;
Log information of the service call journal module for all calling between acquisition and recording system between system, allocating time, Called side IP and port numbers, called side IP and port numbers, the service identifiers of calling, success or not;
Alarm threshold value analysis module is found out for periodically learning to historical data under the general situation of each system Performance, obtain n part samples of current system, error number in each pattern representation unit interval t, and removes in sample set Abnormal point, the process for removing abnormal point includes:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Alert analysis module is for the daily record in a timing acquiring nearest t period, the mistake of each system of sequential analysis Accidentally each service error rate of whether abnormal, each system of number whether abnormal, example error rate whether abnormal, any two system Between error number situation, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned It is abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the mistake that each group of clientInstance calls serverInstance Number;
From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and Server counts the cumulative error number that each group of client system calls server systems;
Alarm display module is used to be based on system topological figure, and system topological is illustrated in after the completion of alarm data analysis On figure.
Further, alarm threshold value analysis module is additionally operable to setting alarm threshold, in the error number alarm for calculating the system After line, if alertNum<Then alertNum is set as alarm threshold to alarm threshold.
Further, the removal abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
Further, alarm display module is additionally operable to:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
Further, the width of the line is related to error number.
Based on the method for statistical method monitoring system availability, include the following steps:
Step 1, the log information of all calling, allocating time, called side IP and port numbers, quilt between acquisition and recording system Called side IP and port numbers, the service identifiers of calling, success or not;
Step 2 periodically learns historical data, finds out the performance under the general situation of each system, is worked as N part samples of preceding system, error number in each pattern representation unit interval t, and the abnormal point in sample set is removed, it removes The process of abnormal point includes:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Whether step 3, the daily record in a timing acquiring nearest t period, the error number of each system of sequential analysis are different Often, whether each service error rate of each system is abnormal, whether example error rate is abnormal, error number between any two system Situation, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned It is abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the mistake that each group of clientInstance calls serverInstance Number;
From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and Server counts the cumulative error number that each group of client system calls server systems;
Step 4 is based on system topological figure, is illustrated in after the completion of alarm data analysis on system topological figure.
Further, step 1 further includes:
Alarm threshold is set, after the error number alarm line for calculating the system, if alertNum<Alarm threshold is then AlertNum is set as alarm threshold.
Further, step 2 removal abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
Further, step 4 further includes following steps:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
Further, the width of the line is related to error number.
Compared with prior art, the invention has the advantages that and advantageous effect:
1. the present invention can by service call daily record between analysis example come monitoring system, the service of system, system reality Whether example is abnormal, and system topological figure is combined to show warning information;When showing warning information, the present invention is on system topological figure The state for showing the state called between system mode, system, system service and example, to go wrong in large area system When quickly orient problem system.
2. the present invention obtains alarm threshold by the Normal appearances of analysis the past period system;In alert analysis When, error number is just alerted more than the threshold value;Scalar type is alerted, the automatic setting method of alarm threshold is provided, reduces Manually, the accuracy rate for improving alarm greatly reduces wrong report and fails to report both of these case.New system on-line running is for a period of time Afterwards, can alarm threshold be arranged for it automatically in the present invention.
3. can verify that whether the service of analysis system, example are abnormal, and proportional-type is alerted, the accurate of alarm is improved Rate reduces wrong report and fails to report.
4. monitoring method sampling is real data, more comprehensively than periodic sampling covering.
Description of the drawings
Fig. 1 is normal distribution schematic diagram.
Fig. 2 is journal format exemplary plot.
Fig. 3 is alarm threshold value analysis process figure.
Fig. 4 is sample error number schematic diagram in the system for calling Logstash interfaces to obtain.
Fig. 5 calls datagram between example.
Fig. 6 is the system topological figure for adding caution sign.
Fig. 7 is the system topological figure that bullet layer shows error message.
Specific implementation mode
Technical solution provided by the invention is described in detail below with reference to specific embodiment, it should be understood that following specific Embodiment is only illustrative of the invention and is not intended to limit the scope of the invention.In addition, step shown in the flowchart of the accompanying drawings Suddenly it can be executed in the computer system of such as a group of computer-executable instructions, although also, showing in flow charts Logical order, but in some cases, it can be with the steps shown or described are performed in an order that is different from the one herein.
It is believed that the error number in a system unit time t is influenced by the factor of many independent random factors, The influence very little of each factor under normal circumstances, still can as the stochastic variable of a Normal Distribution come Research.The density function of normal distribution is:
By acquiring the performance data of system the past period under normal circumstances, we can calculate the unit interval The average value u and standard deviation std of error number in t.Error number in a nearest unit interval t is denoted as failNum, such as Fig. 1 Shown, we are easy to that probability P (failNum can be calculated>=u+3*std) be much smaller than 0.01, i.e., it is one extreme Small probability event.So we go to observe the error number occurred in the nearest unit interval t of the system, value adds beyond mean value The case where upper three standard deviations must be extreme case, need manually to pay close attention to, should send out warning information.
We are in the error rate of error number research system and service by observing system, it is easy to find that:Even if The acceptable value of error rate p0 highests is 0.01, when actual observation 100 times is called, more than 1 time malloc failure malloc can not illustrate be It unites problematic, because this is the larger event of probability of happening.
When call number is fewer (we take less than 40 times here), we calculate when the error rate in system nature P is not higher than p0, but the error number failNum observed in n times calling is more than the conditional probability p1 of failLevel:
We are referred to as small probability event at the event by probability of happening less than 0.05.It is small general in a small amount of limited number of time experiment Rate event should not occur, i.e., when small probability event occurs, we must not believe that p is not higher than p0, and will be understood that p and be higher than P0, system mistake rate is excessively high at this time, should send out alarm.By numerical operation, we have found all so that p<0.05 The critical point of failLevel:Work as n<When=5, the critical point of failLvel is 0, when 5<n<When=35, failLevel's is critical Point is 1, when 35<n<When 40, the critical point of failLevel is 2.I.e.:It observes n times to call, if error number is higher than corresponding FailLevel is considered as that a small probability event has occurred, needs to pay close attention to;Think that system is normal if being not higher than.For ease of Processing, we are by n<FailLevel when 40 is uniformly set as 1, its error is in tolerance interval in practice.
When call number is more (here it is considered that no less than 40 times), it is observed that error rate be p1, be The error rate of system substantially is p, is not higher than p0 under normal circumstances.According to central-limit theorem, it is understood that p1 approximations are obeyed The normal distribution that value is p, variance is p (1-p)/n, that is, statisticObey standardized normal distribution.Work as p<=p0 When,Approximation obeys standardized normal distribution;By standardized normal distribution Table it is found that working asWhen, probability is less than 0.05, is small probability event, should pay close attention to, send out alarm.It answers for convenience With we willIt is deformed into:Wherein n*p1 is exactly the error number actually observed.
Corresponding abnormality judgment method includes the following steps:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<1, it returns normal;Otherwise in next step;
If c) num>Otherwise=40 next step are returned normally as fNum=1, are otherwise returned abnormal;
If d) fNum/num<0.01, it returns normal;Otherwise in next step;
E) work as fNum<When tNum, k=fNum+0.5, otherwise k=fNum-0.5 is (because being approximate normal distribution, by repairing Can just statistic be made more to approach normal distribution);
F) it calculates
If z>1.645 return to exception, otherwise return normal.
Each data in abnormality judgment method can adjust as needed.
System provided by the invention based on statistical method monitoring system availability, including:Service call daily record between system Module, alarm threshold value analysis module, alert analysis module, monitoring alarm display module.It can directly be obtained from system topological figure Take the service list and example list of each system.This department submit application No. is 2017109039551, entitled system deployment With service list, example list are elaborated in the patent of invention of dependence automatic Plotting System and method in more detail and is had Close service and the daily record of example.
Service call journal module acquires and has recorded the log information of all calling between system between system.Specifically: The specific deployment of system (Application) on the server is referred to as example (Instance) by us, and example is by institute In the port numbers unique mark that the IP and example of server are occupied.After one example calls a certain service of another example, adjust It can be recorded with side and call daily record (as shown in the figure), include in daily record:Allocating time (startTime), called side IP (consumerIp) and port numbers (consumerPort), called side IP (serviceIp) and port numbers (servicePort), the service identifiers (serviceName) of calling, success or not (success).Service call day between system Using Logstash, this Open-Source Tools stores these daily records to will module, can be by data in 2 seconds after calling behavior It preserves.It is as shown in Figure 2 to store daily record.
Alarm threshold value analysis module periodically learns historical data, finds out the table under the general situation of each system Existing, specific implementation process is as shown in figure 3, include the following steps:
1, Ergodic Theory list:
A) all service lists of current system are obtained;
B) Logstash interfaces are called, the cumulative error number of all services of the system in the time range of nearest n*t is taken, And be divided into n parts as unit of t, i.e., we obtain the n of current system part samples, each sample all describes unit interval t Interior error number, as shown in Figure 4;
C) abnormal point in sample set is removed:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point, this step is by historical abnormal conditions It finds out and rejects, it is avoided to influence the assessment generally showed system;
Iii. if n1=0 or (n-n2)>30 or (n-n2)>N/3 then completes the removal of abnormal point, continues to execute Step d);Otherwise, step i is executed;
D) the mean value u and standard deviation std of new samples collection are calculated;
Calculate the error number alarm line alertNum=u+std*3 of the system;If alertNum<100 alertNum It is set as 100.
Alert analysis module timing acquiring (such as acquisition per minute is primary, and acquisition interval can adjust as needed) nearest one Each service error rate of daily record in a t periods, whether abnormal, each system of the error number of each system of sequential analysis is Whether no exception, example error rate abnormal, the error number situation between any two system, and the specific method is as follows:
1, the daily record of nearest t times is extracted from Logstash, i.e. exhaustion goes out all existing a examples calling b examples c clothes It is engaged in successfully counting, unsuccessfully counts such relationship, as shown in figure 5, being denoted by data:
2, data is arranged, field consumerIp and field consumerPort are merged into field Field serviceIp and field servicePort are merged into field serverInstance by clientInstance;
3, data are arranged, the accumulative positive exact figures of each serverInstance and each serviceName is calculated, tires out Count error number;
4, Ergodic Theory list:
A) the cumulative error number of each system is counted, i.e., the sum of the error number of all services under the system;
If b) the cumulative error number of the system is more than the alarm threshold alertNum of the system, mark the system different Often;
C) each service (serviceName) for traversing the system is substituted into using aforementioned abnormality judgment method Accumulative positive exact figures, the cumulative error number of serviceName, to judge whether it is abnormal;
D) each example (serverInstance) for traversing the system is substituted into using aforementioned abnormality judgment method Accumulative positive exact figures, the cumulative error number of serverInstance, to judge whether it is abnormal;
5, data are arranged, the error number that each group of clientInstance calls serverInstance is calculated;
From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and Server counts the cumulative error number that each group of client system calls server systems.
Alarm display module is based on system topological figure, and system topological figure is illustrated in after the completion of alarm data analysis On.
1, when system exception, the mark of warning is added in system icon, as shown in Figure 6;
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message, as shown in Figure 7;
3, when call error number is not 0 between system, line and directive property arrow are drawn, the width of line is pair of error number Number.The general equation that other substitute into error numbers can also be used and calculate the width of line, if enable line width or color with it is wrong Accidentally number correlation can meet the present invention claims.
When system breaks down, we can easily find out from figure:Which system there is a problem, influence Which system, which example of system and service there is a problem.
The present invention also provides the methods based on statistical method monitoring system availability, including service call daily record between system Step;Alarm threshold value analytical procedure;Alert analysis step;Monitoring alarm shows step;Service call daily record step is held between system The content that service call journal module is realized between row system, alarm threshold value analytical procedure execute what alarm threshold value analysis module was realized Content, alert analysis step execute the content that alert analysis module is realized, monitoring alarm shows that step executes monitoring alarm displaying The content that module is realized.
The technical means disclosed in the embodiments of the present invention is not limited only to the technological means disclosed in the above embodiment, further includes By the above technical characteristic arbitrarily the formed technical solution of combination.It should be pointed out that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (10)

1. the system based on statistical method monitoring system availability, which is characterized in that including:Service call daily record mould between system Block, alarm threshold value analysis module, alert analysis module, monitoring alarm display module;
Log information of the service call journal module for all calling between acquisition and recording system, allocating time, calling between system Square IP and port numbers, called side IP and port numbers, the service identifiers of calling, success or not;
Alarm threshold value analysis module finds out the table under the general situation of each system for periodically learning to historical data It is existing, obtain n part samples of current system, error number in each pattern representation unit interval t, and remove in sample set Abnormal point, the process for removing abnormal point include:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample set Middle removal calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Alert analysis module is for the daily record in a timing acquiring nearest t period, the error number of each system of sequential analysis Extremely whether, whether abnormal, example error rate is between exception, any two system for each service error rate of each system Error number situation, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the error number that each group of clientInstance calls serverInstance;
It is counter from system topological figure to look into the corresponding system client and server of clientInstance and serverInstance, Count the cumulative error number that each group of client system calls server systems;
Alarm display module is used to be based on system topological figure, and system topological figure is illustrated in after the completion of alarm data analysis On.
2. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that:Alarm threshold value Analysis module is additionally operable to setting alarm threshold, after the error number alarm line for calculating the system, if alertNum<Alert threshold Then alertNum is set as alarm threshold to value.
3. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that:The removal Abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
4. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that:Alarm display Module is additionally operable to:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
5. the system according to claim 4 based on statistical method monitoring system availability, it is characterised in that:The line Width it is related to error number.
6. the method based on statistical method monitoring system availability, which is characterized in that include the following steps:
Step 1, the log information of all calling between acquisition and recording system are allocating time, called side IP and port numbers, called Square IP and port numbers, the service identifiers of calling, success or not;
Step 2 periodically learns historical data, finds out the performance under the general situation of each system, obtains current system N part samples of system, error number in each pattern representation unit interval t, and the abnormal point in sample set is removed, removal is abnormal Point process include:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample set Middle removal calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Step 3, the daily record in a timing acquiring nearest t period, whether the error number of each system of sequential analysis abnormal, Whether whether abnormal, example error rate abnormal for each service error rate of each system, error number feelings between any two system Condition, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the error number that each group of clientInstance calls serverInstance;
It is counter from system topological figure to look into the corresponding system client and server of clientInstance and serverInstance, Count the cumulative error number that each group of client system calls server systems;
Step 4 is based on system topological figure, is illustrated in after the completion of alarm data analysis on system topological figure.
7. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 1 is also Including:
Alarm threshold is set, after the error number alarm line for calculating the system, if alertNum<Alarm threshold is then AlertNum is set as alarm threshold.
8. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 2 is gone Except abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
9. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 4 is also Include the following steps:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
10. the method according to claim 9 based on statistical method monitoring system availability, it is characterised in that:The company The width of line is related to error number.
CN201810150782.5A 2018-02-13 System and method for monitoring system availability based on statistical method Active CN108599977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810150782.5A CN108599977B (en) 2018-02-13 System and method for monitoring system availability based on statistical method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810150782.5A CN108599977B (en) 2018-02-13 System and method for monitoring system availability based on statistical method

Publications (2)

Publication Number Publication Date
CN108599977A true CN108599977A (en) 2018-09-28
CN108599977B CN108599977B (en) 2021-09-28

Family

ID=

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109617740A (en) * 2018-12-28 2019-04-12 广东亿迅科技有限公司 A kind of method and device that application failure quickly positions
CN110086682A (en) * 2019-05-22 2019-08-02 四川新网银行股份有限公司 Service link call relation view and failure root based on TCP are because of localization method
CN111510351A (en) * 2020-04-10 2020-08-07 星辰天合(北京)数据科技有限公司 Anomaly detection method and device based on Promissuris monitoring system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102299897A (en) * 2010-06-23 2011-12-28 电子科技大学 Characteristic-association-based peer-to-peer networking characteristic analysis method
CN102932466A (en) * 2012-11-07 2013-02-13 网宿科技股份有限公司 Distributed type source monitoring method and distributed type source monitoring system based on content delivery network
CN103514259A (en) * 2013-08-13 2014-01-15 江苏华大天益电力科技有限公司 Abnormal data detection and modification method based on numerical value relevance model
US20140115400A1 (en) * 2012-10-23 2014-04-24 Electronics And Telecommunications Research Institute Device and method for fault management of smart device
CN106407082A (en) * 2016-09-30 2017-02-15 国家电网公司 Method and device for alarming information system
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102299897A (en) * 2010-06-23 2011-12-28 电子科技大学 Characteristic-association-based peer-to-peer networking characteristic analysis method
US20140115400A1 (en) * 2012-10-23 2014-04-24 Electronics And Telecommunications Research Institute Device and method for fault management of smart device
CN102932466A (en) * 2012-11-07 2013-02-13 网宿科技股份有限公司 Distributed type source monitoring method and distributed type source monitoring system based on content delivery network
CN103514259A (en) * 2013-08-13 2014-01-15 江苏华大天益电力科技有限公司 Abnormal data detection and modification method based on numerical value relevance model
CN106407082A (en) * 2016-09-30 2017-02-15 国家电网公司 Method and device for alarming information system
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109617740A (en) * 2018-12-28 2019-04-12 广东亿迅科技有限公司 A kind of method and device that application failure quickly positions
CN110086682A (en) * 2019-05-22 2019-08-02 四川新网银行股份有限公司 Service link call relation view and failure root based on TCP are because of localization method
CN111510351A (en) * 2020-04-10 2020-08-07 星辰天合(北京)数据科技有限公司 Anomaly detection method and device based on Promissuris monitoring system
CN111510351B (en) * 2020-04-10 2021-09-14 星辰天合(北京)数据科技有限公司 Anomaly detection method and device based on Promissuris monitoring system

Similar Documents

Publication Publication Date Title
US20190279098A1 (en) Behavior Analysis and Visualization for a Computer Infrastructure
CN101470426B (en) Fault detection method and system
EP3105644B1 (en) Method of identifying anomalies
JP5077835B2 (en) Plant analysis system
WO2016188100A1 (en) Information system fault scenario information collection method and system
JP5098821B2 (en) Monitoring device and monitoring method for detecting a sign of failure of monitored system
Zheng et al. Anomaly localization in large-scale clusters
KR20190021560A (en) Failure prediction system using big data and failure prediction method
JP2008191839A (en) Abnormality sign detection system
US20030056156A1 (en) Method and apparatus for monitoring the activity of a system
US20120116827A1 (en) Plant analyzing system
CN109345060B (en) Product quality characteristic error traceability analysis method based on multi-source perception
CN106940678B (en) System real-time health degree evaluation and analysis method and device
CN108599977A (en) System and method based on statistical method monitoring system availability
CN105550094B (en) A kind of high-availability system state automatic monitoring method
KR101281460B1 (en) Method for anomaly detection using statistical process control
CN108599977B (en) System and method for monitoring system availability based on statistical method
JP6798968B2 (en) Noise cause estimation device
EP3187950B1 (en) A method for managing alarms in a control system
EP3500896B1 (en) Method of monitoring and controlling an industrial process, and a process control system
CN104731056B (en) The method of the quick operation stability for judging chemical production device
CN106911519A (en) A kind of data acquisition monitoring method and device
JP2015028700A (en) Failure detection device, failure detection method, failure detection program and recording medium
US11131985B2 (en) Noise generation cause estimation device
CN212623707U (en) Megametric equipment management system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant