CN108599977A

CN108599977A - System and method based on statistical method monitoring system availability

Info

Publication number: CN108599977A
Application number: CN201810150782.5A
Authority: CN
Inventors: 梅存兵
Original assignee: Nanjing Tu Niu Science And Technology Ltd
Current assignee: Nanjing Tu Niu Science And Technology Ltd
Priority date: 2018-02-13
Filing date: 2018-02-13
Publication date: 2018-09-28
Anticipated expiration: 2038-02-13
Also published as: CN108599977B

Abstract

The present invention proposes the system and method based on statistical method monitoring system availability, and system includes：Service call journal module, alarm threshold value analysis module, alert analysis module, monitoring alarm display module between system；By the calling daily record between acquisition system, analytic learning periodically is carried out to historical data, obtains the general performance of each system；Data in a nearest unit interval t are analyzed, distinguish whether the current error number of each system unusual, the error rate called between system whether abnormal, system respectively service each example availability it is whether abnormal；And go out call relation between abnormal system, abnormal system on system topological figure to alert formal notation.When showing warning information, the present invention shows the state of the state called between system mode, system, system service and example on system topological figure, quickly to orient problem system when something goes wrong in large area system.

Description

System and method based on statistical method monitoring system availability

Technical field

The invention belongs to software systems monitoring technology field, be related to a kind of be based on statistical method monitoring system availability System and method.

Background technology

Internet enterprises have generally comprised a large amount of application system, and in addition to the website of opening, APP etc., inside also can There are operation, the management of many application system supports enterprise.Generally there are more complex call relation between internal application system, one The function that a system is supplied to another system to call is referred to as to service.The availability monitor industry of application system generally take with Lower means：

Method one：Using tools such as zabbix, certain some index of monitoring system server, such as：Web system into number of passes/ Thread Count, cpu load, free memory, http abnormalities number of codes, request response time etc..When index is more than given threshold Shi Jinhang alarms.

Method two：Simulant-client carries out the indexs such as periodically invoked, detection service end system responds content, speed It is no to meet given threshold.It alarms when index is more than given threshold.

But existing monitor mode has a variety of defects：

1. the threshold value in method one and method two is required for manually setting, the threshold value of different system is multifarious, same system Different times threshold value of uniting is also completely different, and the setting and maintenance of threshold value have prodigious workload.It is general to use in practical operation Trial-and-error method, that is, report by mistake after relax threshold value, fail to report after tighten threshold value, such rate of false alarm, rate of failing to report are all very high.

2. the monitoring of method one can only partial reaction availability, and cannot function as actual approve- useful index, detected Exception do not represent system availability reduce, system it is unavailable when also not all react on these monitor control indexes.

3. availability has directly been reacted in the monitoring of method two, but its as sampling observation means sample size less, covering surface compared with It is narrow, be only capable of monitoring read operation and it is less be used for write operation.

4. when system is more, more complex, the index of above two monitoring method is excessive, alarm quantity is more, alarm noise Greatly, the judgement and positioning of problem can be influenced.

5. when new system is reached the standard grade, new service is reached the standard grade, system and service arrangement change, above two monitoring method It is required for manual maintenance monitored item, is not suitable for having failure to automatically switch, the system of Dynamic expansion service ability.

6. when carrying out error rate monitoring alarm, threshold method often results in wrong report, such as when error rate requirements are no more than 1% When, if once-through operation, which only has occurred, and has failed (error rate 100%) to alert, but it is not necessarily to alarm in most cases.

7. the multiple systems of complication system cluster break down simultaneously, it is difficult to quickly orient really break down be System, can only beard eyebrow tackle all problems at once, waste valuable time.

Invention content

To solve the above problems, the present invention proposes the system and method based on statistical method monitoring system availability, lead to The calling daily record between acquisition system is crossed, analytic learning periodically is carried out to historical data, obtains the general performance of each system；To most Data in a nearly unit interval t are analyzed, and distinguish what whether each system current error number was called between abnormality, system Whether whether extremely abnormal, system respectively services the availability of each example to error rate；And on system topological figure in the form of alerting mark Remember and call relation between abnormal system, abnormal system.

In order to achieve the above object, the present invention provides the following technical solutions：

Based on the system of statistical method monitoring system availability, including：Service call journal module, alarm threshold value between system Analysis module, alert analysis module, monitoring alarm display module；

Log information of the service call journal module for all calling between acquisition and recording system between system, allocating time, Called side IP and port numbers, called side IP and port numbers, the service identifiers of calling, success or not；

Alarm threshold value analysis module is found out for periodically learning to historical data under the general situation of each system Performance, obtain n part samples of current system, error number in each pattern representation unit interval t, and removes in sample set Abnormal point, the process for removing abnormal point includes：

I. the mean value of current sample set is calculatedAnd standard deviation

Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point；

Iii. if meeting condition, the removal of abnormal point is completed, following steps are continued to execute；Otherwise, step i is executed；

Calculate the error number alarm line alertNum=u+std*3 of the system；

Alert analysis module is for the daily record in a timing acquiring nearest t period, the mistake of each system of sequential analysis Accidentally each service error rate of whether abnormal, each system of number whether abnormal, example error rate whether abnormal, any two system Between error number situation, and specifically made the following judgment after Ergodic Theory list：

If a) the cumulative error number of the system is more than the alarm threshold of the system, the system exception is marked；

B) each service for traversing the system judges whether its error rate is abnormal using abnormality judgment method；

C) each example for traversing the system, using abnormality judgment method, to judge whether its error rate is abnormal；

The abnormality judgment method includes：

A) correct number is denoted as tNum, the number of mistake is denoted as fNum, total call number num=tNum+fNum；

If b) fNum<First threshold returns normal；Otherwise in next step；

If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned It is abnormal；

If d) fNum/num<Third threshold value returns normal；Otherwise in next step；

E) work as fNum<When tNum, the 4th threshold values of k=fNum+, otherwise the 5th threshold values of k=fNum-；

F) it calculates

If z>6th threshold value then returns to exception, otherwise returns normal；

Judge to complete final finishing data, calculates the mistake that each group of clientInstance calls serverInstance Number；

From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and Server counts the cumulative error number that each group of client system calls server systems；

Alarm display module is used to be based on system topological figure, and system topological is illustrated in after the completion of alarm data analysis On figure.

Further, alarm threshold value analysis module is additionally operable to setting alarm threshold, in the error number alarm for calculating the system After line, if alertNum<Then alertNum is set as alarm threshold to alarm threshold.

Further, the removal abnormal point process conditional is as follows：

N1=0 or (n-n2)>30 or (n-n2)>n/3.

Further, alarm display module is additionally operable to：

1, when system exception, the mark of warning is added in system icon：

2, when the service of system, example exception, system icon is clicked, bullet layer shows error message；

3, when call error number is not 0 between system, line and directive property arrow are drawn.

Further, the width of the line is related to error number.

Based on the method for statistical method monitoring system availability, include the following steps：

Step 1, the log information of all calling, allocating time, called side IP and port numbers, quilt between acquisition and recording system Called side IP and port numbers, the service identifiers of calling, success or not；

Step 2 periodically learns historical data, finds out the performance under the general situation of each system, is worked as N part samples of preceding system, error number in each pattern representation unit interval t, and the abnormal point in sample set is removed, it removes The process of abnormal point includes：

I. the mean value of current sample set is calculatedAnd standard deviation

Calculate the error number alarm line alertNum=u+std*3 of the system；

Whether step 3, the daily record in a timing acquiring nearest t period, the error number of each system of sequential analysis are different Often, whether each service error rate of each system is abnormal, whether example error rate is abnormal, error number between any two system Situation, and specifically made the following judgment after Ergodic Theory list：

The abnormality judgment method includes：

If b) fNum<First threshold returns normal；Otherwise in next step；

If d) fNum/num<Third threshold value returns normal；Otherwise in next step；

F) it calculates

If z>6th threshold value then returns to exception, otherwise returns normal；

Step 4 is based on system topological figure, is illustrated in after the completion of alarm data analysis on system topological figure.

Further, step 1 further includes：

Alarm threshold is set, after the error number alarm line for calculating the system, if alertNum<Alarm threshold is then AlertNum is set as alarm threshold.

Further, step 2 removal abnormal point process conditional is as follows：

N1=0 or (n-n2)>30 or (n-n2)>n/3.

Further, step 4 further includes following steps：

1, when system exception, the mark of warning is added in system icon：

Further, the width of the line is related to error number.

Compared with prior art, the invention has the advantages that and advantageous effect：

1. the present invention can by service call daily record between analysis example come monitoring system, the service of system, system reality Whether example is abnormal, and system topological figure is combined to show warning information；When showing warning information, the present invention is on system topological figure The state for showing the state called between system mode, system, system service and example, to go wrong in large area system When quickly orient problem system.

2. the present invention obtains alarm threshold by the Normal appearances of analysis the past period system；In alert analysis When, error number is just alerted more than the threshold value；Scalar type is alerted, the automatic setting method of alarm threshold is provided, reduces Manually, the accuracy rate for improving alarm greatly reduces wrong report and fails to report both of these case.New system on-line running is for a period of time Afterwards, can alarm threshold be arranged for it automatically in the present invention.

3. can verify that whether the service of analysis system, example are abnormal, and proportional-type is alerted, the accurate of alarm is improved Rate reduces wrong report and fails to report.

4. monitoring method sampling is real data, more comprehensively than periodic sampling covering.

Description of the drawings

Fig. 1 is normal distribution schematic diagram.

Fig. 2 is journal format exemplary plot.

Fig. 3 is alarm threshold value analysis process figure.

Fig. 4 is sample error number schematic diagram in the system for calling Logstash interfaces to obtain.

Fig. 5 calls datagram between example.

Fig. 6 is the system topological figure for adding caution sign.

Fig. 7 is the system topological figure that bullet layer shows error message.

Specific implementation mode

Technical solution provided by the invention is described in detail below with reference to specific embodiment, it should be understood that following specific Embodiment is only illustrative of the invention and is not intended to limit the scope of the invention.In addition, step shown in the flowchart of the accompanying drawings Suddenly it can be executed in the computer system of such as a group of computer-executable instructions, although also, showing in flow charts Logical order, but in some cases, it can be with the steps shown or described are performed in an order that is different from the one herein.

It is believed that the error number in a system unit time t is influenced by the factor of many independent random factors, The influence very little of each factor under normal circumstances, still can as the stochastic variable of a Normal Distribution come Research.The density function of normal distribution is：

By acquiring the performance data of system the past period under normal circumstances, we can calculate the unit interval The average value u and standard deviation std of error number in t.Error number in a nearest unit interval t is denoted as failNum, such as Fig. 1 Shown, we are easy to that probability P (failNum can be calculated>=u+3*std) be much smaller than 0.01, i.e., it is one extreme Small probability event.So we go to observe the error number occurred in the nearest unit interval t of the system, value adds beyond mean value The case where upper three standard deviations must be extreme case, need manually to pay close attention to, should send out warning information.

We are in the error rate of error number research system and service by observing system, it is easy to find that：Even if The acceptable value of error rate p0 highests is 0.01, when actual observation 100 times is called, more than 1 time malloc failure malloc can not illustrate be It unites problematic, because this is the larger event of probability of happening.

When call number is fewer (we take less than 40 times here), we calculate when the error rate in system nature P is not higher than p0, but the error number failNum observed in n times calling is more than the conditional probability p1 of failLevel:

We are referred to as small probability event at the event by probability of happening less than 0.05.It is small general in a small amount of limited number of time experiment Rate event should not occur, i.e., when small probability event occurs, we must not believe that p is not higher than p0, and will be understood that p and be higher than P0, system mistake rate is excessively high at this time, should send out alarm.By numerical operation, we have found all so that p<0.05 The critical point of failLevel：Work as n<When=5, the critical point of failLvel is 0, when 5<n<When=35, failLevel's is critical Point is 1, when 35<n<When 40, the critical point of failLevel is 2.I.e.：It observes n times to call, if error number is higher than corresponding FailLevel is considered as that a small probability event has occurred, needs to pay close attention to；Think that system is normal if being not higher than.For ease of Processing, we are by n<FailLevel when 40 is uniformly set as 1, its error is in tolerance interval in practice.

When call number is more (here it is considered that no less than 40 times), it is observed that error rate be p1, be The error rate of system substantially is p, is not higher than p0 under normal circumstances.According to central-limit theorem, it is understood that p1 approximations are obeyed The normal distribution that value is p, variance is p (1-p)/n, that is, statisticObey standardized normal distribution.Work as p<=p0 When,Approximation obeys standardized normal distribution；By standardized normal distribution Table it is found that working asWhen, probability is less than 0.05, is small probability event, should pay close attention to, send out alarm.It answers for convenience With we willIt is deformed into：Wherein n*p1 is exactly the error number actually observed.

Corresponding abnormality judgment method includes the following steps：

If b) fNum<1, it returns normal；Otherwise in next step；

If c) num>Otherwise=40 next step are returned normally as fNum=1, are otherwise returned abnormal；

If d) fNum/num<0.01, it returns normal；Otherwise in next step；

E) work as fNum<When tNum, k=fNum+0.5, otherwise k=fNum-0.5 is (because being approximate normal distribution, by repairing Can just statistic be made more to approach normal distribution)；

F) it calculates

If z>1.645 return to exception, otherwise return normal.

Each data in abnormality judgment method can adjust as needed.

System provided by the invention based on statistical method monitoring system availability, including：Service call daily record between system Module, alarm threshold value analysis module, alert analysis module, monitoring alarm display module.It can directly be obtained from system topological figure Take the service list and example list of each system.This department submit application No. is 2017109039551, entitled system deployment With service list, example list are elaborated in the patent of invention of dependence automatic Plotting System and method in more detail and is had Close service and the daily record of example.

Service call journal module acquires and has recorded the log information of all calling between system between system.Specifically： The specific deployment of system (Application) on the server is referred to as example (Instance) by us, and example is by institute In the port numbers unique mark that the IP and example of server are occupied.After one example calls a certain service of another example, adjust It can be recorded with side and call daily record (as shown in the figure), include in daily record：Allocating time (startTime), called side IP (consumerIp) and port numbers (consumerPort), called side IP (serviceIp) and port numbers (servicePort), the service identifiers (serviceName) of calling, success or not (success).Service call day between system Using Logstash, this Open-Source Tools stores these daily records to will module, can be by data in 2 seconds after calling behavior It preserves.It is as shown in Figure 2 to store daily record.

Alarm threshold value analysis module periodically learns historical data, finds out the table under the general situation of each system Existing, specific implementation process is as shown in figure 3, include the following steps：

1, Ergodic Theory list：

A) all service lists of current system are obtained；

B) Logstash interfaces are called, the cumulative error number of all services of the system in the time range of nearest n*t is taken, And be divided into n parts as unit of t, i.e., we obtain the n of current system part samples, each sample all describes unit interval t Interior error number, as shown in Figure 4；

C) abnormal point in sample set is removed：

I. the mean value of current sample set is calculatedAnd standard deviation

Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample This concentration removes, and calculates the number n2 of the new samples collection after taking out above-mentioned sample point, this step is by historical abnormal conditions It finds out and rejects, it is avoided to influence the assessment generally showed system；

Iii. if n1=0 or (n-n2)>30 or (n-n2)>N/3 then completes the removal of abnormal point, continues to execute Step d)；Otherwise, step i is executed；

D) the mean value u and standard deviation std of new samples collection are calculated；

Calculate the error number alarm line alertNum=u+std*3 of the system；If alertNum<100 alertNum It is set as 100.

Alert analysis module timing acquiring (such as acquisition per minute is primary, and acquisition interval can adjust as needed) nearest one Each service error rate of daily record in a t periods, whether abnormal, each system of the error number of each system of sequential analysis is Whether no exception, example error rate abnormal, the error number situation between any two system, and the specific method is as follows：

1, the daily record of nearest t times is extracted from Logstash, i.e. exhaustion goes out all existing a examples calling b examples c clothes It is engaged in successfully counting, unsuccessfully counts such relationship, as shown in figure 5, being denoted by data：

2, data is arranged, field consumerIp and field consumerPort are merged into field Field serviceIp and field servicePort are merged into field serverInstance by clientInstance；

3, data are arranged, the accumulative positive exact figures of each serverInstance and each serviceName is calculated, tires out Count error number；

4, Ergodic Theory list：

A) the cumulative error number of each system is counted, i.e., the sum of the error number of all services under the system；

If b) the cumulative error number of the system is more than the alarm threshold alertNum of the system, mark the system different Often；

C) each service (serviceName) for traversing the system is substituted into using aforementioned abnormality judgment method Accumulative positive exact figures, the cumulative error number of serviceName, to judge whether it is abnormal；

D) each example (serverInstance) for traversing the system is substituted into using aforementioned abnormality judgment method Accumulative positive exact figures, the cumulative error number of serverInstance, to judge whether it is abnormal；

5, data are arranged, the error number that each group of clientInstance calls serverInstance is calculated；

From system topological figure it is counter look into the corresponding system client of clientInstance and serverInstance and Server counts the cumulative error number that each group of client system calls server systems.

Alarm display module is based on system topological figure, and system topological figure is illustrated in after the completion of alarm data analysis On.

1, when system exception, the mark of warning is added in system icon, as shown in Figure 6；

2, when the service of system, example exception, system icon is clicked, bullet layer shows error message, as shown in Figure 7；

3, when call error number is not 0 between system, line and directive property arrow are drawn, the width of line is pair of error number Number.The general equation that other substitute into error numbers can also be used and calculate the width of line, if enable line width or color with it is wrong Accidentally number correlation can meet the present invention claims.

When system breaks down, we can easily find out from figure：Which system there is a problem, influence Which system, which example of system and service there is a problem.

The present invention also provides the methods based on statistical method monitoring system availability, including service call daily record between system Step；Alarm threshold value analytical procedure；Alert analysis step；Monitoring alarm shows step；Service call daily record step is held between system The content that service call journal module is realized between row system, alarm threshold value analytical procedure execute what alarm threshold value analysis module was realized Content, alert analysis step execute the content that alert analysis module is realized, monitoring alarm shows that step executes monitoring alarm displaying The content that module is realized.

The technical means disclosed in the embodiments of the present invention is not limited only to the technological means disclosed in the above embodiment, further includes By the above technical characteristic arbitrarily the formed technical solution of combination.It should be pointed out that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims

1. the system based on statistical method monitoring system availability, which is characterized in that including：Service call daily record mould between system Block, alarm threshold value analysis module, alert analysis module, monitoring alarm display module；

Log information of the service call journal module for all calling between acquisition and recording system, allocating time, calling between system Square IP and port numbers, called side IP and port numbers, the service identifiers of calling, success or not；

Alarm threshold value analysis module finds out the table under the general situation of each system for periodically learning to historical data It is existing, obtain n part samples of current system, error number in each pattern representation unit interval t, and remove in sample set Abnormal point, the process for removing abnormal point include：

I. the mean value of current sample set is calculatedAnd standard deviation

Ii. all sample points more than u+3*std in sample set are found, calculate its number n1, and by these data from sample set Middle removal calculates the number n2 of the new samples collection after taking out above-mentioned sample point；

Calculate the error number alarm line alertNum=u+std*3 of the system；

Alert analysis module is for the daily record in a timing acquiring nearest t period, the error number of each system of sequential analysis Extremely whether, whether abnormal, example error rate is between exception, any two system for each service error rate of each system Error number situation, and specifically made the following judgment after Ergodic Theory list：

The abnormality judgment method includes：

If b) fNum<First threshold returns normal；Otherwise in next step；

If c) num>=second threshold then in next step, is otherwise returned normally when fNum=first thresholds, is otherwise returned abnormal；

If d) fNum/num<Third threshold value returns normal；Otherwise in next step；

F) it calculates

If z>6th threshold value then returns to exception, otherwise returns normal；

Judge to complete final finishing data, calculates the error number that each group of clientInstance calls serverInstance；

It is counter from system topological figure to look into the corresponding system client and server of clientInstance and serverInstance, Count the cumulative error number that each group of client system calls server systems；

Alarm display module is used to be based on system topological figure, and system topological figure is illustrated in after the completion of alarm data analysis On.

2. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that：Alarm threshold value Analysis module is additionally operable to setting alarm threshold, after the error number alarm line for calculating the system, if alertNum<Alert threshold Then alertNum is set as alarm threshold to value.

3. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that：The removal Abnormal point process conditional is as follows：

N1=0 or (n-n2)>30 or (n-n2)>n/3.

4. the system according to claim 1 based on statistical method monitoring system availability, it is characterised in that：Alarm display Module is additionally operable to：

1, when system exception, the mark of warning is added in system icon：

5. the system according to claim 4 based on statistical method monitoring system availability, it is characterised in that：The line Width it is related to error number.

6. the method based on statistical method monitoring system availability, which is characterized in that include the following steps：

Step 1, the log information of all calling between acquisition and recording system are allocating time, called side IP and port numbers, called Square IP and port numbers, the service identifiers of calling, success or not；

Step 2 periodically learns historical data, finds out the performance under the general situation of each system, obtains current system N part samples of system, error number in each pattern representation unit interval t, and the abnormal point in sample set is removed, removal is abnormal Point process include：

I. the mean value of current sample set is calculatedAnd standard deviation

Calculate the error number alarm line alertNum=u+std*3 of the system；

Step 3, the daily record in a timing acquiring nearest t period, whether the error number of each system of sequential analysis abnormal, Whether whether abnormal, example error rate abnormal for each service error rate of each system, error number feelings between any two system Condition, and specifically made the following judgment after Ergodic Theory list：

The abnormality judgment method includes：

If b) fNum<First threshold returns normal；Otherwise in next step；

If d) fNum/num<Third threshold value returns normal；Otherwise in next step；

F) it calculates

If z>6th threshold value then returns to exception, otherwise returns normal；

7. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 1 is also Including：

8. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 2 is gone Except abnormal point process conditional is as follows：

N1=0 or (n-n2)>30 or (n-n2)>n/3.

9. the method according to claim 6 based on statistical method monitoring system availability, which is characterized in that step 4 is also Include the following steps：

1, when system exception, the mark of warning is added in system icon：

10. the method according to claim 9 based on statistical method monitoring system availability, it is characterised in that：The company The width of line is related to error number.