CN108599977A - System and method based on statistical method monitoring system availability - Google Patents
System and method based on statistical method monitoring system availability Download PDFInfo
- Publication number
- CN108599977A CN108599977A CN201810150782.5A CN201810150782A CN108599977A CN 108599977 A CN108599977 A CN 108599977A CN 201810150782 A CN201810150782 A CN 201810150782A CN 108599977 A CN108599977 A CN 108599977A
- Authority
- CN
- China
- Prior art keywords
- abnormal
- fnum
- alarm
- error number
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/065—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0681—Configuration of triggering conditions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/069—Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/22—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0817—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Environmental & Geological Engineering (AREA)
- Telephonic Communication Services (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention proposes the system and method based on statistical method monitoring system availability, and system includes:Service call journal module, alarm threshold value analysis module, alert analysis module, monitoring alarm display module between system;By the calling daily record between acquisition system, analytic learning periodically is carried out to historical data, obtains the general performance of each system;Data in a nearest unit interval t are analyzed, distinguish whether the current error number of each system unusual, the error rate called between system whether abnormal, system respectively service each example availability it is whether abnormal;And go out call relation between abnormal system, abnormal system on system topological figure to alert formal notation.When showing warning information, the present invention shows the state of the state called between system mode, system, system service and example on system topological figure, quickly to orient problem system when something goes wrong in large area system.
Description
Technical field
The invention belongs to software systems monitoring technology field, be related to a kind of be based on statistical method monitoring system availability
System and method.
Background technology
Internet enterprises have generally comprised a large amount of application system, and in addition to the website of opening, APP etc., inside also can
There are operation, the management of many application system supports enterprise.Generally there are more complex call relation between internal application system, one
The function that a system is supplied to another system to call is referred to as to service.The availability monitor industry of application system generally take with
Lower means:
Method one:Using tools such as zabbix, certain some index of monitoring system server, such as:Web system into number of passes/
Thread Count, cpu load, free memory, http abnormalities number of codes, request response time etc..When index is more than given threshold
Shi Jinhang alarms.
Method two:Simulant-client carries out the indexs such as periodically invoked, detection service end system responds content, speed
It is no to meet given threshold.It alarms when index is more than given threshold.
But existing monitor mode has a variety of defects:
1. the threshold value in method one and method two is required for manually setting, the threshold value of different system is multifarious, same system
Different times threshold value of uniting is also completely different, and the setting and maintenance of threshold value have prodigious workload.It is general to use in practical operation
Trial-and-error method, that is, report by mistake after relax threshold value, fail to report after tighten threshold value, such rate of false alarm, rate of failing to report are all very high.
2. the monitoring of method one can only partial reaction availability, and cannot function as actual approve- useful index, detected
Exception do not represent system availability reduce, system it is unavailable when also not all react on these monitor control indexes.
3. availability has directly been reacted in the monitoring of method two, but its as sampling observation means sample size less, covering surface compared with
It is narrow, be only capable of monitoring read operation and it is less be used for write operation.
4. when system is more, more complex, the index of above two monitoring method is excessive, alarm quantity is more, alarm noise
Greatly, the judgement and positioning of problem can be influenced.
5. when new system is reached the standard grade, new service is reached the standard grade, system and service arrangement change, above two monitoring method
It is required for manual maintenance monitored item, is not suitable for having failure to automatically switch, the system of Dynamic expansion service ability.
6. when carrying out error rate monitoring alarm, threshold method often results in wrong report, such as when error rate requirements are no more than 1%
When, if once-through operation, which only has occurred, and has failed (error rate 100%) to alert, but it is not necessarily to alarm in most cases.
7. the multiple systems of complication system cluster break down simultaneously, it is difficult to quickly orient really break down be
System, can only beard eyebrow tackle all problems at once, waste valuable time.
Invention content
To solve the above problems, the present invention proposes the system and method based on statistical method monitoring system availability, lead to
The calling daily record between acquisition system is crossed, analytic learning periodically is carried out to historical data, obtains the general performance of each system;To most
Data in a nearly unit interval t are analyzed, and distinguish what whether each system current error number was called between abnormality, system
Whether whether extremely abnormal, system respectively services the availability of each example to error rate;And on system topological figure in the form of alerting mark
Remember and call relation between abnormal system, abnormal system.
In order to achieve the above object, the present invention provides the following technical solutions:
Based on the system of statistical method monitoring system availability, including:Service call journal module, alarm threshold value between system
Analysis module, alert analysis module, monitoring alarm display module;
Log information of the service call journal module for all calling between acquisition and recording system between system, allocating time,
Called side IP and port numbers, called side IP and port numbers, the service identifiers of calling, success or not;
Alarm threshold value analysis module is found out for periodically learning to historical data under the general situation of each system
Performance, obtain n part samples of current system, error number in each pattern representation unit interval t, and removes in sample set
Abnormal point, the process for removing abnormal point includes:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample
This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Alert analysis module is for the daily record in a timing acquiring nearest t period, the mistake of each system of sequential analysis
Accidentally each service error rate of whether abnormal, each system of number whether abnormal, example error rate whether abnormal, any two system
Between error number situation, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned
It is abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the mistake that each group of clientInstance calls serverInstance
Number;
From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and
Server counts the cumulative error number that each group of client system calls server systems;
Alarm display module is used to be based on system topological figure, and system topological is illustrated in after the completion of alarm data analysis
On figure.
Further, alarm threshold value analysis module is additionally operable to setting alarm threshold, in the error number alarm for calculating the system
After line, if alertNum<Then alertNum is set as alarm threshold to alarm threshold.
Further, the removal abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
Further, alarm display module is additionally operable to:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
Further, the width of the line is related to error number.
Based on the method for statistical method monitoring system availability, include the following steps:
Step 1, the log information of all calling, allocating time, called side IP and port numbers, quilt between acquisition and recording system
Called side IP and port numbers, the service identifiers of calling, success or not;
Step 2 periodically learns historical data, finds out the performance under the general situation of each system, is worked as
N part samples of preceding system, error number in each pattern representation unit interval t, and the abnormal point in sample set is removed, it removes
The process of abnormal point includes:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample
This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Whether step 3, the daily record in a timing acquiring nearest t period, the error number of each system of sequential analysis are different
Often, whether each service error rate of each system is abnormal, whether example error rate is abnormal, error number between any two system
Situation, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned
It is abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the mistake that each group of clientInstance calls serverInstance
Number;
From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and
Server counts the cumulative error number that each group of client system calls server systems;
Step 4 is based on system topological figure, is illustrated in after the completion of alarm data analysis on system topological figure.
Further, step 1 further includes:
Alarm threshold is set, after the error number alarm line for calculating the system, if alertNum<Alarm threshold is then
AlertNum is set as alarm threshold.
Further, step 2 removal abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
Further, step 4 further includes following steps:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
Further, the width of the line is related to error number.
Compared with prior art, the invention has the advantages that and advantageous effect:
1. the present invention can by service call daily record between analysis example come monitoring system, the service of system, system reality
Whether example is abnormal, and system topological figure is combined to show warning information;When showing warning information, the present invention is on system topological figure
The state for showing the state called between system mode, system, system service and example, to go wrong in large area system
When quickly orient problem system.
2. the present invention obtains alarm threshold by the Normal appearances of analysis the past period system;In alert analysis
When, error number is just alerted more than the threshold value;Scalar type is alerted, the automatic setting method of alarm threshold is provided, reduces
Manually, the accuracy rate for improving alarm greatly reduces wrong report and fails to report both of these case.New system on-line running is for a period of time
Afterwards, can alarm threshold be arranged for it automatically in the present invention.
3. can verify that whether the service of analysis system, example are abnormal, and proportional-type is alerted, the accurate of alarm is improved
Rate reduces wrong report and fails to report.
4. monitoring method sampling is real data, more comprehensively than periodic sampling covering.
Description of the drawings
Fig. 1 is normal distribution schematic diagram.
Fig. 2 is journal format exemplary plot.
Fig. 3 is alarm threshold value analysis process figure.
Fig. 4 is sample error number schematic diagram in the system for calling Logstash interfaces to obtain.
Fig. 5 calls datagram between example.
Fig. 6 is the system topological figure for adding caution sign.
Fig. 7 is the system topological figure that bullet layer shows error message.
Specific implementation mode
Technical solution provided by the invention is described in detail below with reference to specific embodiment, it should be understood that following specific
Embodiment is only illustrative of the invention and is not intended to limit the scope of the invention.In addition, step shown in the flowchart of the accompanying drawings
Suddenly it can be executed in the computer system of such as a group of computer-executable instructions, although also, showing in flow charts
Logical order, but in some cases, it can be with the steps shown or described are performed in an order that is different from the one herein.
It is believed that the error number in a system unit time t is influenced by the factor of many independent random factors,
The influence very little of each factor under normal circumstances, still can as the stochastic variable of a Normal Distribution come
Research.The density function of normal distribution is:
By acquiring the performance data of system the past period under normal circumstances, we can calculate the unit interval
The average value u and standard deviation std of error number in t.Error number in a nearest unit interval t is denoted as failNum, such as Fig. 1
Shown, we are easy to that probability P (failNum can be calculated>=u+3*std) be much smaller than 0.01, i.e., it is one extreme
Small probability event.So we go to observe the error number occurred in the nearest unit interval t of the system, value adds beyond mean value
The case where upper three standard deviations must be extreme case, need manually to pay close attention to, should send out warning information.
We are in the error rate of error number research system and service by observing system, it is easy to find that:Even if
The acceptable value of error rate p0 highests is 0.01, when actual observation 100 times is called, more than 1 time malloc failure malloc can not illustrate be
It unites problematic, because this is the larger event of probability of happening.
When call number is fewer (we take less than 40 times here), we calculate when the error rate in system nature
P is not higher than p0, but the error number failNum observed in n times calling is more than the conditional probability p1 of failLevel:
We are referred to as small probability event at the event by probability of happening less than 0.05.It is small general in a small amount of limited number of time experiment
Rate event should not occur, i.e., when small probability event occurs, we must not believe that p is not higher than p0, and will be understood that p and be higher than
P0, system mistake rate is excessively high at this time, should send out alarm.By numerical operation, we have found all so that p<0.05
The critical point of failLevel:Work as n<When=5, the critical point of failLvel is 0, when 5<n<When=35, failLevel's is critical
Point is 1, when 35<n<When 40, the critical point of failLevel is 2.I.e.:It observes n times to call, if error number is higher than corresponding
FailLevel is considered as that a small probability event has occurred, needs to pay close attention to;Think that system is normal if being not higher than.For ease of
Processing, we are by n<FailLevel when 40 is uniformly set as 1, its error is in tolerance interval in practice.
When call number is more (here it is considered that no less than 40 times), it is observed that error rate be p1, be
The error rate of system substantially is p, is not higher than p0 under normal circumstances.According to central-limit theorem, it is understood that p1 approximations are obeyed
The normal distribution that value is p, variance is p (1-p)/n, that is, statisticObey standardized normal distribution.Work as p<=p0
When,Approximation obeys standardized normal distribution;By standardized normal distribution Table it is found that working asWhen, probability is less than 0.05, is small probability event, should pay close attention to, send out alarm.It answers for convenience
With we willIt is deformed into:Wherein n*p1 is exactly the error number actually observed.
Corresponding abnormality judgment method includes the following steps:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<1, it returns normal;Otherwise in next step;
If c) num>Otherwise=40 next step are returned normally as fNum=1, are otherwise returned abnormal;
If d) fNum/num<0.01, it returns normal;Otherwise in next step;
E) work as fNum<When tNum, k=fNum+0.5, otherwise k=fNum-0.5 is (because being approximate normal distribution, by repairing
Can just statistic be made more to approach normal distribution);
F) it calculates
If z>1.645 return to exception, otherwise return normal.
Each data in abnormality judgment method can adjust as needed.
System provided by the invention based on statistical method monitoring system availability, including:Service call daily record between system
Module, alarm threshold value analysis module, alert analysis module, monitoring alarm display module.It can directly be obtained from system topological figure
Take the service list and example list of each system.This department submit application No. is 2017109039551, entitled system deployment
With service list, example list are elaborated in the patent of invention of dependence automatic Plotting System and method in more detail and is had
Close service and the daily record of example.
Service call journal module acquires and has recorded the log information of all calling between system between system.Specifically:
The specific deployment of system (Application) on the server is referred to as example (Instance) by us, and example is by institute
In the port numbers unique mark that the IP and example of server are occupied.After one example calls a certain service of another example, adjust
It can be recorded with side and call daily record (as shown in the figure), include in daily record:Allocating time (startTime), called side IP
(consumerIp) and port numbers (consumerPort), called side IP (serviceIp) and port numbers
(servicePort), the service identifiers (serviceName) of calling, success or not (success).Service call day between system
Using Logstash, this Open-Source Tools stores these daily records to will module, can be by data in 2 seconds after calling behavior
It preserves.It is as shown in Figure 2 to store daily record.
Alarm threshold value analysis module periodically learns historical data, finds out the table under the general situation of each system
Existing, specific implementation process is as shown in figure 3, include the following steps:
1, Ergodic Theory list:
A) all service lists of current system are obtained;
B) Logstash interfaces are called, the cumulative error number of all services of the system in the time range of nearest n*t is taken,
And be divided into n parts as unit of t, i.e., we obtain the n of current system part samples, each sample all describes unit interval t
Interior error number, as shown in Figure 4;
C) abnormal point in sample set is removed:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample
This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point, this step is by historical abnormal conditions
It finds out and rejects, it is avoided to influence the assessment generally showed system;
Iii. if n1=0 or (n-n2)>30 or (n-n2)>N/3 then completes the removal of abnormal point, continues to execute
Step d);Otherwise, step i is executed;
D) the mean value u and standard deviation std of new samples collection are calculated;
Calculate the error number alarm line alertNum=u+std*3 of the system;If alertNum<100 alertNum
It is set as 100.
Alert analysis module timing acquiring (such as acquisition per minute is primary, and acquisition interval can adjust as needed) nearest one
Each service error rate of daily record in a t periods, whether abnormal, each system of the error number of each system of sequential analysis is
Whether no exception, example error rate abnormal, the error number situation between any two system, and the specific method is as follows:
1, the daily record of nearest t times is extracted from Logstash, i.e. exhaustion goes out all existing a examples calling b examples c clothes
It is engaged in successfully counting, unsuccessfully counts such relationship, as shown in figure 5, being denoted by data:
2, data is arranged, field consumerIp and field consumerPort are merged into field
Field serviceIp and field servicePort are merged into field serverInstance by clientInstance;
3, data are arranged, the accumulative positive exact figures of each serverInstance and each serviceName is calculated, tires out
Count error number;
4, Ergodic Theory list:
A) the cumulative error number of each system is counted, i.e., the sum of the error number of all services under the system;
If b) the cumulative error number of the system is more than the alarm threshold alertNum of the system, mark the system different
Often;
C) each service (serviceName) for traversing the system is substituted into using aforementioned abnormality judgment method
Accumulative positive exact figures, the cumulative error number of serviceName, to judge whether it is abnormal;
D) each example (serverInstance) for traversing the system is substituted into using aforementioned abnormality judgment method
Accumulative positive exact figures, the cumulative error number of serverInstance, to judge whether it is abnormal;
5, data are arranged, the error number that each group of clientInstance calls serverInstance is calculated;
From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and
Server counts the cumulative error number that each group of client system calls server systems.
Alarm display module is based on system topological figure, and system topological figure is illustrated in after the completion of alarm data analysis
On.
1, when system exception, the mark of warning is added in system icon, as shown in Figure 6;
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message, as shown in Figure 7;
3, when call error number is not 0 between system, line and directive property arrow are drawn, the width of line is pair of error number
Number.The general equation that other substitute into error numbers can also be used and calculate the width of line, if enable line width or color with it is wrong
Accidentally number correlation can meet the present invention claims.
When system breaks down, we can easily find out from figure:Which system there is a problem, influence
Which system, which example of system and service there is a problem.
The present invention also provides the methods based on statistical method monitoring system availability, including service call daily record between system
Step;Alarm threshold value analytical procedure;Alert analysis step;Monitoring alarm shows step;Service call daily record step is held between system
The content that service call journal module is realized between row system, alarm threshold value analytical procedure execute what alarm threshold value analysis module was realized
Content, alert analysis step execute the content that alert analysis module is realized, monitoring alarm shows that step executes monitoring alarm displaying
The content that module is realized.
The technical means disclosed in the embodiments of the present invention is not limited only to the technological means disclosed in the above embodiment, further includes
By the above technical characteristic arbitrarily the formed technical solution of combination.It should be pointed out that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (10)
1. the system based on statistical method monitoring system availability, which is characterized in that including:Service call daily record mould between system
Block, alarm threshold value analysis module, alert analysis module, monitoring alarm display module;
Log information of the service call journal module for all calling between acquisition and recording system, allocating time, calling between system
Square IP and port numbers, called side IP and port numbers, the service identifiers of calling, success or not;
Alarm threshold value analysis module finds out the table under the general situation of each system for periodically learning to historical data
It is existing, obtain n part samples of current system, error number in each pattern representation unit interval t, and remove in sample set
Abnormal point, the process for removing abnormal point include:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample set
Middle removal calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Alert analysis module is for the daily record in a timing acquiring nearest t period, the error number of each system of sequential analysis
Extremely whether, whether abnormal, example error rate is between exception, any two system for each service error rate of each system
Error number situation, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the error number that each group of clientInstance calls serverInstance;
It is counter from system topological figure to look into the corresponding system client and server of clientInstance and serverInstance,
Count the cumulative error number that each group of client system calls server systems;
Alarm display module is used to be based on system topological figure, and system topological figure is illustrated in after the completion of alarm data analysis
On.
2. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that:Alarm threshold value
Analysis module is additionally operable to setting alarm threshold, after the error number alarm line for calculating the system, if alertNum<Alert threshold
Then alertNum is set as alarm threshold to value.
3. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that:The removal
Abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
4. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that:Alarm display
Module is additionally operable to:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
5. the system according to claim 4 based on statistical method monitoring system availability, it is characterised in that:The line
Width it is related to error number.
6. the method based on statistical method monitoring system availability, which is characterized in that include the following steps:
Step 1, the log information of all calling between acquisition and recording system are allocating time, called side IP and port numbers, called
Square IP and port numbers, the service identifiers of calling, success or not;
Step 2 periodically learns historical data, finds out the performance under the general situation of each system, obtains current system
N part samples of system, error number in each pattern representation unit interval t, and the abnormal point in sample set is removed, removal is abnormal
Point process include:
I. the mean value of current sample set is calculatedAnd standard deviation
Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample set
Middle removal calculates the number n2 of the new samples collection after taking out above-mentioned sample point;
Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute;Otherwise, step i is executed;
Calculate the error number alarm line alertNum=u+std*3 of the system;
Step 3, the daily record in a timing acquiring nearest t period, whether the error number of each system of sequential analysis abnormal,
Whether whether abnormal, example error rate abnormal for each service error rate of each system, error number feelings between any two system
Condition, and specifically made the following judgment after Ergodic Theory list:
If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked;
B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method;
C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal;
The abnormality judgment method includes:
A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum;
If b) fNum<First threshold returns normal;Otherwise in next step;
If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned abnormal;
If d) fNum/num<Third threshold value returns normal;Otherwise in next step;
E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-;
F) it calculates
If z>6th threshold value then returns to exception, otherwise returns normal;
Judge to complete final finishing data, calculates the error number that each group of clientInstance calls serverInstance;
It is counter from system topological figure to look into the corresponding system client and server of clientInstance and serverInstance,
Count the cumulative error number that each group of client system calls server systems;
Step 4 is based on system topological figure, is illustrated in after the completion of alarm data analysis on system topological figure.
7. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 1 is also
Including:
Alarm threshold is set, after the error number alarm line for calculating the system, if alertNum<Alarm threshold is then
AlertNum is set as alarm threshold.
8. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 2 is gone
Except abnormal point process conditional is as follows:
N1=0 or (n-n2)>30 or (n-n2)>n/3.
9. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 4 is also
Include the following steps:
1, when system exception, the mark of warning is added in system icon:
2, when the service of system, example exception, system icon is clicked, bullet layer shows error message;
3, when call error number is not 0 between system, line and directive property arrow are drawn.
10. the method according to claim 9 based on statistical method monitoring system availability, it is characterised in that:The company
The width of line is related to error number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810150782.5A CN108599977B (en) | 2018-02-13 | 2018-02-13 | System and method for monitoring system availability based on statistical method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810150782.5A CN108599977B (en) | 2018-02-13 | 2018-02-13 | System and method for monitoring system availability based on statistical method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108599977A true CN108599977A (en) | 2018-09-28 |
CN108599977B CN108599977B (en) | 2021-09-28 |
Family
ID=63608860
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810150782.5A Active CN108599977B (en) | 2018-02-13 | 2018-02-13 | System and method for monitoring system availability based on statistical method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108599977B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109617740A (en) * | 2018-12-28 | 2019-04-12 | 广东亿迅科技有限公司 | A kind of method and device that application failure quickly positions |
CN110086682A (en) * | 2019-05-22 | 2019-08-02 | 四川新网银行股份有限公司 | Service link call relation view and failure root based on TCP are because of localization method |
CN111510351A (en) * | 2020-04-10 | 2020-08-07 | 星辰天合(北京)数据科技有限公司 | Anomaly detection method and device based on Promissuris monitoring system |
CN113962273A (en) * | 2021-09-22 | 2022-01-21 | 北京必示科技有限公司 | Multi-index-based time series anomaly detection method and system and storage medium |
CN114002233A (en) * | 2021-04-09 | 2022-02-01 | 住华科技股份有限公司 | Method and system for monitoring automatic optical detection device |
CN114500326A (en) * | 2022-02-25 | 2022-05-13 | 北京百度网讯科技有限公司 | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium |
CN115037636A (en) * | 2022-06-06 | 2022-09-09 | 阿里云计算有限公司 | Service quality perception method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102299897A (en) * | 2010-06-23 | 2011-12-28 | 电子科技大学 | Characteristic-association-based peer-to-peer networking characteristic analysis method |
CN102932466A (en) * | 2012-11-07 | 2013-02-13 | 网宿科技股份有限公司 | Distributed type source monitoring method and distributed type source monitoring system based on content delivery network |
CN103514259A (en) * | 2013-08-13 | 2014-01-15 | 江苏华大天益电力科技有限公司 | Abnormal data detection and modification method based on numerical value relevance model |
US20140115400A1 (en) * | 2012-10-23 | 2014-04-24 | Electronics And Telecommunications Research Institute | Device and method for fault management of smart device |
CN106407082A (en) * | 2016-09-30 | 2017-02-15 | 国家电网公司 | Method and device for alarming information system |
CN107612756A (en) * | 2017-10-31 | 2018-01-19 | 广西宜州市联森网络科技有限公司 | A kind of operation management system with intelligent trouble analyzing and processing function |
-
2018
- 2018-02-13 CN CN201810150782.5A patent/CN108599977B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102299897A (en) * | 2010-06-23 | 2011-12-28 | 电子科技大学 | Characteristic-association-based peer-to-peer networking characteristic analysis method |
US20140115400A1 (en) * | 2012-10-23 | 2014-04-24 | Electronics And Telecommunications Research Institute | Device and method for fault management of smart device |
CN102932466A (en) * | 2012-11-07 | 2013-02-13 | 网宿科技股份有限公司 | Distributed type source monitoring method and distributed type source monitoring system based on content delivery network |
CN103514259A (en) * | 2013-08-13 | 2014-01-15 | 江苏华大天益电力科技有限公司 | Abnormal data detection and modification method based on numerical value relevance model |
CN106407082A (en) * | 2016-09-30 | 2017-02-15 | 国家电网公司 | Method and device for alarming information system |
CN107612756A (en) * | 2017-10-31 | 2018-01-19 | 广西宜州市联森网络科技有限公司 | A kind of operation management system with intelligent trouble analyzing and processing function |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109617740A (en) * | 2018-12-28 | 2019-04-12 | 广东亿迅科技有限公司 | A kind of method and device that application failure quickly positions |
CN110086682A (en) * | 2019-05-22 | 2019-08-02 | 四川新网银行股份有限公司 | Service link call relation view and failure root based on TCP are because of localization method |
CN110086682B (en) * | 2019-05-22 | 2022-06-24 | 四川新网银行股份有限公司 | Service link calling relation view and fault root cause positioning method based on TCP |
CN111510351A (en) * | 2020-04-10 | 2020-08-07 | 星辰天合(北京)数据科技有限公司 | Anomaly detection method and device based on Promissuris monitoring system |
CN111510351B (en) * | 2020-04-10 | 2021-09-14 | 星辰天合(北京)数据科技有限公司 | Anomaly detection method and device based on Promissuris monitoring system |
CN114002233A (en) * | 2021-04-09 | 2022-02-01 | 住华科技股份有限公司 | Method and system for monitoring automatic optical detection device |
CN113962273A (en) * | 2021-09-22 | 2022-01-21 | 北京必示科技有限公司 | Multi-index-based time series anomaly detection method and system and storage medium |
CN114500326A (en) * | 2022-02-25 | 2022-05-13 | 北京百度网讯科技有限公司 | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium |
CN114500326B (en) * | 2022-02-25 | 2023-08-11 | 北京百度网讯科技有限公司 | Abnormality detection method, abnormality detection device, electronic device, and storage medium |
CN115037636A (en) * | 2022-06-06 | 2022-09-09 | 阿里云计算有限公司 | Service quality perception method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108599977B (en) | 2021-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108599977A (en) | System and method based on statistical method monitoring system availability | |
CN101470426B (en) | Fault detection method and system | |
US11657309B2 (en) | Behavior analysis and visualization for a computer infrastructure | |
CN109412870B (en) | Alarm monitoring method and platform, server and storage medium | |
JP5077835B2 (en) | Plant analysis system | |
JP2015028700A (en) | Failure detection device, failure detection method, failure detection program and recording medium | |
CN110321352A (en) | Production line monitoring method and device, electronic equipment and readable storage medium | |
CN109034423B (en) | Fault early warning judgment method, device, equipment and storage medium | |
US20120116827A1 (en) | Plant analyzing system | |
CN106940678B (en) | System real-time health degree evaluation and analysis method and device | |
CN114550336B (en) | Equipment inspection method and device, computer equipment and storage medium | |
CN109345060B (en) | Product quality characteristic error traceability analysis method based on multi-source perception | |
CN118260120B (en) | Method, device, equipment and storage medium for monitoring memory of electric energy meter | |
EP3187950A1 (en) | A method for managing alarms in a control system | |
CN114531338A (en) | Monitoring alarm and tracing method and system based on call chain data | |
CN117240594B (en) | Multi-dimensional network security operation and maintenance protection management system and method | |
CN111736579B (en) | Industrial control equipment safety detection method based on log inquiry and retention | |
Zheng et al. | Anomaly localization in large-scale clusters | |
CN116841790A (en) | Off-line business monitoring method and system based on risk control | |
CN111314110A (en) | Fault early warning method for distributed system | |
EP1296247A1 (en) | Method and apparatus for monitoring the activity of a system | |
CN113992496B (en) | Abnormal alarm method and device based on quartile algorithm and computing equipment | |
CN113918372A (en) | Early warning system of data development platform based on flink realization | |
US11131985B2 (en) | Noise generation cause estimation device | |
CN117493129B (en) | Operating power monitoring system of computer control equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |