CN116886517B

CN116886517B - Alarm system and method based on flow data

Info

Publication number: CN116886517B
Application number: CN202311130213.1A
Authority: CN
Inventors: 李伟山
Original assignee: Jiangsu Dianshi Letou Technology Co ltd
Current assignee: Jiangsu Dianshi Letou Technology Co ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-11-24
Anticipated expiration: 2043-09-04
Also published as: CN116886517A

Abstract

The invention provides an alarm system and method based on flow data, comprising setting an alarm threshold value for an interface; collecting and storing request data and response time using the embedded point and the background thread; according to the preset alarm threshold value, aggregating and calculating each piece of flow data, and judging whether an alarm condition is reached; when the alarm is triggered, verifying and resetting the alarm threshold value, and eliminating false alarm caused by improper setting; if the abnormality is confirmed, an algorithm is called, and the cause is found out by using an AI model; and setting the alarm frequency and the notification mode according to the alarm frequency and the classification of the interfaces. The invention actively collects flow data and aggregates according to application, interface and time granularity, can monitor system flow in real time, improves efficiency of problem discovery, provides a method for designating responsible persons according to application dimensions, enables responsible persons to autonomously select a desired alarm notification channel, and enhances the accessibility of alarm messages.

Description

Alarm system and method based on flow data

Technical Field

The invention relates to the technical field of flow data, in particular to an alarm system and method based on flow data.

Background

In the current financial industry, the popularity and development of internet technology has led to many innovations and changes, especially for businesses that are based on a B-terminal (browser)/S-terminal (server) architecture. Such architecture cannot avoid interface interactions, and anomalies in the interface, such as unavailability or slow response, may have a significant impact on the stability and performance of the application.

At present, the interface anomalies are often discovered by relying on active feedback of users or spot inspection of inspectors, which undoubtedly makes the problem discovery channel very passive, and greatly prevents research and development personnel from being able to discover and locate the problem at the first time. In addition, in the process of locating problems, research and development personnel need to start from an entrance and check the problems step by step according to business logic, and the root cause cannot be located immediately. This results in an extended exposure time of the problem, a reduced user experience, and thus a huge loss to the company.

On the other hand, since developers are generally not directly faced with customers, the abnormal feedback link is also long. When a customer has a problem, the problem is collected by a customer service or specially-docked personnel, then fed back to a corresponding development responsible person across departments, and the development responsible person assigns the problem to a specific research and development personnel according to the field of the interface with the problem. Thus, not only the communication cost is increased, but also the speed and efficiency of problem solving are prolonged.

Therefore, the problems of the current alarm system, such as poor adaptability to real-time flow change, incapability of providing accurate abnormal positioning, large alarm noise and the like, are particularly serious in the financial industry. Therefore, the alarm system based on the flow data is developed more prominently, the alarm threshold value can be calculated elastically, the abnormality can be positioned accurately, the necessity of alarm noise is reduced effectively, the problem solving efficiency is improved, the user feedback and the problem solving time are reduced, the user experience is improved, and the loss of companies is reduced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a warning system and a warning method based on flow data, which are used for solving the problems that in a production environment, the flow abnormality is found out in time, early warning cannot be carried out in advance, a feedback link of the flow abnormality problem is overlong, and the speed of solving the abnormality by positioning research personnel is low.

In order to achieve the above object, the present invention provides an alarm method based on flow data, including:

step S1: setting an explicit alarm threshold for each interface;

step S2: collecting and storing request data and response time using the embedded point and the background thread;

step S3: according to the preset alarm threshold value, aggregating and calculating each piece of flow data, and judging whether an alarm condition is reached;

step S4: when the alarm is triggered, verifying and resetting the alarm threshold value, and eliminating false alarm caused by improper setting;

step S5: if the abnormality is confirmed, an algorithm is called, and the cause is found out by using an AI model;

step S6: and setting the alarm frequency and the notification mode according to the alarm frequency and the classification of the interfaces.

Further, the step S1 is specifically as follows:

step S11: logging in a system, and selecting an application and an interface to be monitored;

step S12: setting alarm thresholds of real-time rules and delay rules according to the use condition and the requirement of a specific interface;

step S13: and the threshold rule is selected to ensure the accuracy and practicability of the alarm.

Further, the step S2 specifically includes:

step S21: when a request access is initiated, the request is sent to a server through protocols such as HTTP;

step S22: collecting and storing related data information in a memory of a server by using a buried point mode, wherein the related data information comprises request data and corresponding response time;

step S23: starting a background thread, summarizing all data in the memory every second, and throwing the data information into the kafka message middleware;

step S24: the alarm system performs summary calculation on the same application, interface and data generated in the same second by monitoring the kafka message, and then stores the data into the Influxdb time sequence database for subsequent use and analysis, wherein the data service comprises inquiring model data according to conditions, modifying model data according to a main key, modifying model data in batches, adding model data in batches, deleting model data according to the main key, deleting model data in batches and aggregating data service.

Further, the step S4 is specifically as follows:

step S41: after an alarm is sent, firstly, whether false alarm possibly caused by unreasonable threshold setting is needed to be checked;

step S42: according to the dimension of the application and the interface, checking historical flow data in the past month, and judging whether the current QPS/RT data accords with the historical rule;

step S43: if the data accords with the history rule, inputting the current data into a threshold calculation algorithm, calculating a new threshold and resetting the threshold, and ending the alarm processing flow;

step S44: if the data does not accord with the history rule, the abnormal situation is indicated, and the next alarm flow needs to be continuously executed.

Further, step S5 is specifically that after confirming that an abnormal situation occurs, an algorithm of an abnormality locating module needs to be called, and the abnormality locating module uses an AI model to analyze based on the request link, the log output content and the hardware resource load situation information, so as to give the reason of the abnormality.

Further, the step S6 specifically includes:

step S61: determining the frequency and notification mode of the alarm according to the importance of the interface and whether the noise reduction function is selected;

step S62: after the number of times of continuously sending out the alarm reaches a threshold value, the system can automatically raise the importance level of the interface and correspondingly raise the frequency of the alarm;

step S63: the alarm information can be sent to the responsible person of the interface in various modes such as enterprise WeChat, weChat public number, E-mail and short message.

Further, the method also comprises self-learning threshold adjustment, specifically:

step S71: collecting and storing a large amount of historical traffic data and alarm events;

step S72: machine learning algorithms (e.g., decision trees, neural networks, etc.) are used to learn and model these data. The characteristics of the input may include time, date, type and number of requests, etc., and the output may be targeted to whether an alarm event has occurred.

Step S73: and automatically adjusting the threshold value of the alarm according to the prediction result of the model. For example, if the model predicts that the amount of access to the system will generally increase between 18:00 and 20:00 a day, then the threshold may be raised during this period.

Further, the method also comprises the step of predictive warning:

step S81: the class collects and stores historical traffic data and alarm events.

Step S82: predictive machine learning models (e.g., ARIMA, LSTM, etc.) are used to predict future traffic and alarm events. For example, it may be predicted what the number of requests per minute will exceed the threshold of the alert in the next hour.

Step S83: if the predicted result shows that an alarm event is likely to occur, early warning is sent out in advance, and enough time is given to a service responsible person to prevent the occurrence of the problem.

An alert system based on traffic data, comprising: the system comprises a threshold configuration module, a flow data collection module, a real-time alarm processing module, a threshold elastic calculation module, an abnormality positioning module and an alarm notification and noise reduction module;

the threshold configuration module is responsible for setting interface alarm thresholds including, but not limited to, thresholds such as request times, response times, and the like. The user can customize an explicit alarm threshold value for each interface according to the own requirements;

the traffic data collection module is responsible for collecting request data and response time for each function in real time. The flow information accessed by the user is asynchronously summarized into a database of an alarm system in a point burying mode;

the real-time alarm processing module is used for judging whether each flow data reaches a preset alarm threshold value in real time. If the threshold is reached, an alarm may need to be triggered;

the threshold elastic calculation module is used for verifying and resetting the alarm threshold when the alarm is triggered, so as to avoid false alarm caused by unreasonable threshold setting;

the abnormality locating module is used for analyzing possible reasons including but not limited to information such as request links, log output content, hardware resource load conditions and the like through an algorithm after abnormality is confirmed;

the alarm notification and noise reduction module is used for selecting a proper alarm notification mode, such as enterprise WeChat, weChat public number, mailbox, short message and the like, according to the alarm frequency and the classification of the interface, and can reduce the alarm frequency under the condition that the interface frequently triggers the alarm but does not influence the user experience.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides an alarm system and method based on flow data, which actively collect the flow data and aggregate according to application, interface and time granularity, can monitor the system flow in real time, improve the efficiency of problem discovery, provide a responsible person appointed according to the application dimension, and the responsible person can autonomously select the alarm notification channel desired by himself, thereby enhancing the touch rate of alarm information.

2. The invention provides an alarm system and method based on flow data, which realize individuation of system alarm by providing various alarm rule configurations and meet the requirements of different applications and interfaces.

3. The invention provides an alarm system and method based on flow data, which utilizes interface dimension to grade, realizes different noise reduction logics, effectively reduces invalid alarms and avoids the bombing of alarm information.

4. The invention provides a warning system and a warning method based on flow data, which continuously optimize threshold setting through a data algorithm according to historical data, so that the system is more flexible and can adapt to different flow changes.

5. The invention provides a warning system and a warning method based on flow data, which provide preliminary abnormal reason positioning by utilizing the integration of a machine learning algorithm and other data platforms, greatly shorten the problem positioning time and improve the problem processing efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings needed in the embodiments or the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIGS. 2-3 are management system UI designs;

FIG. 4 is a schematic diagram of flow data acquisition according to the present invention;

FIG. 5 is a schematic diagram illustrating a software system design;

FIG. 6 is a flow chart of red line rule calculation;

FIG. 7 is a flowchart of an alternative rule algorithm;

fig. 8 is a flowchart of the operation of the software.

Detailed Description

The technical solution of the present invention will be more clearly and completely explained by the description of the preferred embodiments of the present invention with reference to the accompanying drawings.

As shown in fig. 1, the present invention specifically comprises:

an alarm method based on flow data is applicable to an alarm system based on flow data, and comprises the following steps:

step S1: setting an explicit alarm threshold for each interface;

The step S1 is specifically as follows:

The step S2 is specifically as follows:

The step S4 is specifically as follows:

Step S5 is specifically that after confirming that an abnormal situation occurs, an algorithm of an abnormal positioning module needs to be called, and the abnormal positioning module uses an AI model to analyze based on the request link, the log output content and the hardware resource load situation information, so as to give the reason of the abnormal situation.

The step S6 specifically comprises the following steps:

Also included as a specific embodiment is self-learning threshold adjustment, specifically:

Also included as a specific embodiment is a predictive alert:

As shown in fig. 2-3, for showing the existing traffic rule list configuration, the operation steps are as follows:

(1) Logging in an alarm system page through an account number password;

(2) Opening a rule list page, and checking the existing alarm rule in a paging way;

(3) The condition search can be carried out according to the parameters of the application, the interface and the like;

(4) Clicking a new adding or editing button, and popping up a rule editing popup window;

(5) Filling rule types/affiliated applications/affiliated interfaces;

(6) Clicking the + number, selecting one or more rules to be added, and filling variables in the rules one by one;

(7) Clicking the save button saves the rule configuration.

The flow data collection schematic diagram shown in fig. 4 has the following operation flow:

(1) Running an agent on each web server using agent technology; every time a request is received, the time Stime of the current request is recorded in the memory.

(2) The request data is sent on to the web server for processing.

(3) After the web server finishes processing, the agent records the current time Etime; simultaneously acquiring the starting time Stime recorded in the S1 of the request from a memory;

(4) The actual time taken to get the current request after calculation is in milliseconds (ms) using the request start time Etime, minus the request end time Stime.

(5) And saving the time consumption of the request into the memory.

(6) and the agent sends all data in the memory to a Sentinel data collection service through a TCP interface at intervals.

(7) The Sentinel data aggregation service classifies the received data according to the application/interface, and then calculates the average time consumption and the total number of requests by taking seconds as the dimension.

(8) The Sentinel data aggregation service will aggregate back-end data and drop into the kafka distributed messaging system.

(9) The alert center consumes the kafka message, saves the data in the influxdb database, and executes the alert algorithm.

(10) If there is hit alarm rule, sending alarm message to responsible person through notification module.

FIG. 5 is a schematic diagram illustrating a software system design:

extracting public parts in different types of alarm services by using a modularized thought, and providing a customized entry for the variable parts in the form of an abstract interface; and solidifying the invariable part, using a unified scheduler and template method, reducing the code maintenance cost to the greatest extent, and maintaining certain expansibility.

As a specific embodiment, the flow data collection module: this is the portal to the system responsible for collecting real-time traffic data for the application. The module interacts with servers of other applications and data collected by the buried technology is sent to the alarm system.

And the real-time alarm processing module is used for: the module receives data from the flow data collection module and makes alarm decisions based on preset thresholds. Its input is flow data and its output is a possible alarm event.

Threshold elasticity calculation module: when the real-time alert processing module generates a possible alert event, the module is triggered. It is responsible for checking whether the current threshold is appropriate or not and whether an adjustment is required. If the threshold needs to be adjusted, the module updates the threshold used in the real-time alert processing module.

An abnormality positioning module: if the threshold elastic computation module is verified, it is confirmed that there is indeed an exception, then the module will be invoked. The method can determine possible reasons for the abnormality by analyzing information such as request links, log output content, hardware resource load conditions and the like.

Alarm notification and noise reduction module: finally, the module is responsible for sending alarm information to the interface responsible for sending. Its input is the abnormality cause from the abnormality locating module and the contact information of the interface responsible person. The alert notification will be sent by way of the principal's choice.

FIG. 6 is a flow chart of red line rule calculation;

(1) After the alarm system receives the kafka message, the data is first written into the influxdb. If the writing fails, the process is directly interrupted.

(2) All existing alarm configurations belonging to the red line rule are obtained from redis according to the belonging application and the belonging interface in the kafka message. If no alarm rule is configured, the process of the method is ended.

(3) And calling an alarm scheduler, and executing rule analysis and alarm calculation.

(4) If the result of calculation is that the alarm needs to be executed, the noise reduction algorithm of the alarm notification module is called to judge whether to send the message. If noise reduction is required, the process is ended.

(5) If the noise reduction is not needed, the alarm notification module is utilized to send a notification message to touch the responsible person.

FIG. 7 is a flowchart of another rule algorithm;

(2) Based on the application and interface to which the kafka message belongs, the existing configuration of all other rules not belonging to the red line rule is obtained from redis. If no alarm rule is configured, the process of the method is ended.

(3) And according to the time of the rule configuration, putting the flow data/rule configuration into a time wheel data structure, and waiting for the time wheel to trigger the execution of alarm calculation. The delay trigger time is set to the time configured in the rule.

(4) And triggering alarm calculation according to the time delay configuration.

(5) Traversing the rule set, transmitting the flow data and the rules into an alarm core analysis scheduler, executing rule analysis and alarm calculation.

(6) If the result of calculation is that the alarm needs to be executed, the noise reduction algorithm of the alarm notification module is called to judge whether to send the message. If noise reduction is required, the process is ended.

(7) If the noise reduction is not needed, the alarm notification module is utilized to send a notification message to touch the responsible person.

FIG. 8 is a flowchart of the operation of the finishing software, which is as follows:

(1) Logging in a system, selecting an application/interface, and setting various alarm rules according to the needs;

(2) When a user accesses certain functions, the flow information is asynchronously summarized to a system database in a buried point mode;

(3) According to the set rule, aggregating and calculating each piece of flow data, and judging whether the flow data meets the alarm condition;

(4) And sending alarm information to an interface responsible person by using enterprise WeChat, weChat public number, mailbox, short message and other modes.

As a specific embodiment, for the specificity of the stock finance industry, the alarm rule customization is performed, different thresholds are configured according to time dimension classification, and the threshold configuration is automatically optimized according to a machine learning algorithm, and the specific steps are as follows:

through observation of the financial stock industry, one day can be divided into three stages of collective bidding, opening, and opening, so when the invention designs the threshold value, the flow threshold values of the three stages need to be respectively configured, and the types of the flow threshold values comprise the number of requests per second (hereinafter referred to as QPS) and the average response time per second (hereinafter referred to as RT) of a certain interface; simultaneously, an automatic algorithm is introduced, a more reasonable threshold value is automatically analyzed and calculated according to the historical data of the three stages, the latest threshold value is automatically set, and the false alarm probability is reduced, wherein the method comprises the following steps:

logging in the system, selecting application/interface, setting various alarm rules according to the need, including:

1. [ real-time rules ] set bid/in-disc/after-disc exceeds [ variable Y ] QPS/RT.

2. [ time delay rule ] aggregate bid/in-disc/after-disc, [ variable T ] minutes, [ variable P ]% QPS/RT exceeds the [ variable Y ] value.

3. [ time delay rule ] data exceeding the [ variable Y ] value for [ variable X ] times occurs within [ variable T ] minutes after the set bid/in-disc/after the disc.

4. According to trend in [ variable T ] time, predicting that QPS/RT is about to exceed [ variable Y ] value.

Meaning of each variable:

the integrated bidding/in-disc/after-disc represents the stage of stock market, and single options, when each time alarm calculation is carried out, the threshold rule of which is used is determined preferentially according to the stage of stock;

the variable T represents time, unit minutes, an integer such as 5 minutes, 30 minutes, etc., which is supported up to 1 hour because data exceeding 1 hour has no reference value.

[ variable P ] represents a percentage, a 2-bit decimal floating point number, and the maximum is 100;

the variable X represents the times, and the type is an integer;

the variable Y represents a specific value of the threshold value, and is different types according to the service scene, and the current QPS is an integer and the RT is a floating point number.

Detailed description of rule meanings:

the [ real-time rule ] sets bidding/in-disc/after-disc exceeds [ variable Y ] QPS/RT, and when QPS or response time in the flow data exceeds a set threshold value [ variable Y ], an alarm is directly sent.

After bidding/in-disc/after disc of the aggregate [ variable T ] is finished, P% of QPS/RT exceeds the value of [ variable Y ], and after receiving flow data, a delay calculation task is issued according to time setting [ variable T ]. When a delay calculation task is triggered, inquiring all historical data in the time T from the influxdb, sorting the historical data according to the magnitude of the numerical value, configuring [ variable P ] according to the percentage, calculating the specific value positioned in [ variable P ]% in the historical data, comparing the specific value of the percentile with the value in the flow data, and triggering an alarm if the specific value of the percentile is smaller than the value in the flow data; otherwise, ending the alarm calculation.

And after the bidding/in-disc/after the disc is completed, the data of the [ variable X ] exceeding the value of the [ variable Y ] appears in the [ variable T ] minutes, and after the flow data is received, the [ variable T ] is set according to the time, and a delay calculation task is issued. When the delay calculation task is triggered, inquiring all historical data in the time T from the influxdb, counting the historical data of a threshold value [ variable Y ] in configuration, and if the counted value is larger than the threshold value [ variable X ], triggering an alarm; otherwise, ending the alarm calculation.

According to trend within T time, predicting that QPS/RT is about to exceed [ variable Y ] value, setting [ variable T ] according to time after receiving flow data, and issuing a delay calculation task. When the delay calculation task is triggered, inquiring all historical data in [ variable T ] time from the influxdb, calculating a predicted value by using a prediction algorithm and the historical data, and triggering an alarm if the predicted value is larger than a threshold value [ variable Y ]; otherwise, ending the alarm calculation.

Future results are predicted from a historical piece of data, and as a result, the classification problem common algorithms, which are fixed and are not exploded if the first judgment is made, are k-nearest neighbor, naive Bayes, decision tree and logistic regression. The predicted future result is a continuous value such as the second calculation trend, belongs to regression problem, common algorithms are linear regression and ridge regression, and taking the linear regression algorithm as an example, the linear regression algorithm is obtained by calling the api of the algorithm according to the input variable to obtain a result formula such as:

wherein w represents a weight, and x represents a specific value b; taking the scene of the invention as an example: setting: the flow data weight within 30 minutes to 1 hour is 0.2, the flow data weight within 10 minutes to 30 minutes is 0.3, and the weight within 10 minutes is 0.5; the bias term is 0; and then the flow data value (QPS or RT) and the weight are brought into a formula.

The intelligent alarm system further comprises a threshold optimization algorithm and an AI intelligent positioning algorithm, when an alarm is triggered, the threshold optimization algorithm is firstly executed to perform calculation once so as to expect to reduce the probability of misnotice, the threshold optimization algorithm firstly performs primary judgment according to the response time of the current flow, and if the response time is smaller than 1s, the latest value is directly used for replacing the original threshold. If the value is larger than 1s, calculating the value of P99 according to the historical data of nearly three weeks and the Quantile algorithm, and if the value of P99 is larger than the value of the current alarm, directly replacing the original threshold value by using the value of P99; if the P99 value is smaller than or equal to the current alarm value, continuing to trigger the alarm. The AI intelligent positioning algorithm uses the data of all available measurement systems and the historical abnormal positioning results, uses a gradient descent method (wi+1=wi-di·ηi, i=0, 1, … wi to represent the weight value, the parameter η is the learning rate, and di represents the gradient of the loss function total_loss corresponding to the weight value i) to perform model training, and uses the model to calculate possible abnormal reasons whenever the alarm message needs to be sent.

The system also comprises an alarm message notification and noise reduction algorithm, and 99% of anomalies are continuous through the research and experience discovery of the inventor on the historical anomalies. For example, the response time of a certain interface is only 30ms originally, and is directly increased to 3s after an abnormality occurs, so that the response time of the interface can continuously fluctuate around 3s until the abnormality is repaired. If the alarm is triggered every time of the request, information bombing is formed, and the problem positioning is interfered; if the alarm is not given after one time, a situation of omission or unaware whether the alarm is solved can also occur. The invention provides a method for calculating noise reduction of alarm messages, which comprises the following steps: when setting the alarm rule, it is possible to check whether or not to reduce noise (non-noise reduction means that each anomaly is an alarm), and to set the importance (priority) of the interface. Importance represents the importance of the interface and determines the initial noise reduction time interval. Selectable importance options include: 1. a core interface; 2. an important interface; 3. a normal interface; 4. an auxiliary interface. The core interface represents: this interface is very important, if there is a problem that the whole system will crash; an interface of this type is provided, which alarms every 1 minute. The important interface represents: the interface is important, if a problem occurs, some modules are not available, but the operation of other modules is not affected, so the interface is set to be the type of the interface, the alarm is given every 5 minutes, and when the alarm is given for 10 times continuously, the system can automatically upgrade the importance as a core interface. The normal interface represents: this interface is a single function, affecting only one function if a problem arises. Therefore, the system is set as the interface, alarms are carried out every 10 minutes, and after 10 continuous alarms are carried out, the system can automatically upgrade the importance of the important interface. The auxiliary interface represents: even if the interface has problems, the user can not find, influence or cause user dislike, the interface can not alarm any more within 1 hour after one alarm, and when the interface alarms for 8 times continuously, the system can automatically upgrade the importance to a normal interface. Meanwhile, a manager of the system can freely select whether to use the mailbox, the enterprise WeChat or the short message to touch the alarm message, so that the touch rate of the alarm message is improved from the side.

The method also comprises an alarm calculation flow based on the flow data, and the method comprises the following steps:

when a user accesses certain functions, the flow information is asynchronously summarized to a database of an alarm system in a buried point mode, and the method specifically comprises the following steps: when a C-terminal user accesses a certain function, a request is sent to a server by using protocols such as HTTP, and after the request enters a web server, data is stored in a memory of the server by using a buried point mode; after the request is subjected to operation of the web server to obtain an operation result, the response time of the request is stored in a server memory before the operation result is returned through a protocol; simultaneously, each web server starts a thread in the background, and after the thread gathers all data in the current memory every 1 second, the data are lost into the kafka message middleware; the alarm system collects and calculates the data in the same application, the same interface and the same second by monitoring the kafka message to obtain the total request number, the total response time and the average response time, and then stores the calculated data into the Influxdb time sequence database for later use.

According to the set rule, the step of aggregating and calculating each piece of flow data to judge whether the flow data meets the alarm condition comprises the following steps: firstly, according to the application and the interface, inquiring all configured alarm threshold rules; then, taking seconds as dimensions, inquiring data such as request number, response time and the like; finally, the rule set and the data are transmitted to a rule operation engine to obtain a rule hit result; if the rule is not hit, no alarm is needed; if there is a hit rule, this indicates that a threshold is reached and an alarm may be required.

If an alarm is needed, the alarm threshold value is possibly unreasonable, so that the threshold value elastic calculation module is called to check/reset the threshold value. The threshold elasticity calculation module comprises the following specific steps:

(1) And querying historical flow data within one month of the history according to the dimension of the application and the interface.

(2) Judging whether the QPS/RT accords with the history rule.

(3) If the rule is met, substituting the data of today into a threshold calculation algorithm, calculating the latest threshold, resetting the threshold, and ending the alarm operation.

(4) If the rule is not met, the abnormal condition is indicated, and the alarm flow is continuously executed downwards.

If the abnormal condition is confirmed, an abnormal positioning module algorithm is called, and possible reasons of the abnormal condition are given through an AI model training algorithm according to information such as a request link, log output content, hardware resource load condition and the like. The abnormality locating module comprises the following specific steps:

(1) According to the dimensions of application + interface + time (seconds), an open interface of a call chain system (such as skywalking) is called, all calls in the current seconds are queried, and then obvious overtime operations such as redis operation overtime, mysql operation overtime, remote call overtime and the like are analyzed in sequence.

(2) According to the application dimension, an open interface of a log acquisition system (such as ELK) is called to inquire out all abnormal logs, and then a company knowledge base interface is called in sequence according to the types and descriptions of the abnormal logs, and a machine learning AI algorithm is utilized to calculate the detailed explanation, possible reasons and solutions of each abnormal.

(3) According to the dimension of the application, all hardware resource addresses relied by the application are inquired, then an open interface of a company hardware monitoring system (such as Prometheus) is sequentially called according to the dimension of time (seconds), the current hardware load condition of a machine where the application is located is inquired, and whether the hardware has abnormality such as overhigh CPU load, full memory, overhigh connection number and the like is judged. And according to the abnormal situation, the solutions of adding machines, adding memories, adjusting the maximum limit of the connection number and the like are provided.

(4) And (3) summarizing and de-duplicating all the abnormal analysis results and solutions obtained in the steps 1, 2 and 3. A list of all possible reasons is obtained.

And sending alarm information to an interface responsible person by using enterprise WeChat, weChat public number, mailbox, short message and other modes. The method comprises the following steps:

(1) The responsible person logs in the alarm system in advance, selects the application responsible by himself, and selects and sets an alarm mode.

(2) When an alarm is triggered, the system can inquire the responsible person and the alarm mode according to the application. And then sending out an alarm message in a set alarm mode after organizing the text.

The following is a template example of a message:

the policy master service/search/stock interface, the QPS in disk is higher than 3000, 30% of the response time in 1 minute exceeds 300ms, possible reasons are:

REDIS load is too high, and whether a scene of a large Key query exists can be checked. Or add hardware configuration of REDIS instance;

2. remote invocation of the WeChat acquisition Access_token interface responds too slowly, suggesting an appropriate increase in caching.

The QPS value is significantly higher than the historical data, possibly caused by the front-end continuous retry request, suggesting a check.

As a specific embodiment, in a large e-commerce website, a company uses an alarm system based on flow data to perform service monitoring and maintenance, which is specifically as follows:

1. threshold configuration: the service responsible person configures thresholds of request times and response time according to the importance of the commodity searching interface. For example, if the number of requests exceeds 2000 times, or the average response time exceeds 500ms, set within any one of the 60 second sliding windows, an alarm is triggered.

2. And (3) flow data collection: in the commodity searching interface, every time a user initiates a searching request, the system records information such as request time, response time and the like, and asynchronously gathers the flow data into a database of the alarm system.

3. Real-time alarm processing: the alarm system monitors the collected flow data in real time, finds that the number of requests of the search interface exceeds 2000 times in a certain 60-second sliding window, and triggers an alarm.

4. Threshold elasticity calculation: after triggering the alarm, the system calls a threshold elastic calculation module to check and reset the threshold. Through analysis, the system automatically adjusts the threshold value and avoids false alarm because the increase of the request times accords with the expectation due to the positive active peak period.

5. Abnormality positioning: when the alarm system triggers the alarm again in the subsequent flow data, the system confirms the abnormal state and then invokes the abnormal positioning module. By analyzing the information such as the request link, the log, the hardware load, etc., the system locates a database query operation for too long a response time.

6. Alarm notification and noise reduction: the alarm system sends alarm information to the service responsible person through enterprise WeChat, and the system records the alarm, if the same alarm is triggered again in a short time, the system can reduce the alarm frequency and avoid interference with the service responsible person.

In general, the alarm system based on the flow data accurately captures the abnormal situation and timely informs the service responsible person, so that the service responsible person can rapidly locate the problem and take measures, and the stable operation of the service is ensured.

The above detailed description is merely illustrative of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Various modifications, substitutions and improvements of the technical scheme of the present invention will be apparent to those skilled in the art from the description and drawings provided herein without departing from the spirit and scope of the invention. The scope of the invention is defined by the claims.

Claims

1. An alarm method based on flow data, comprising:

step S1: setting an alarm threshold value for an interface, which is specifically as follows:

step S12: setting alarm thresholds of real-time rules and delay rules according to the use condition and the requirement of an interface;

step S13: the threshold rule is selected to ensure the accuracy and practicability of the alarm;

step S2: request data and response time are collected and stored using the embedded point and background thread, specifically as follows:

step S21: when a request access is initiated, the request is sent to a server through a network protocol;

step S24: the alarm system collects and calculates the same application, interface and data generated in the same second by monitoring the kafka message, and then stores the data into the Influxdb time sequence database for subsequent use and analysis, wherein the data service comprises inquiring model data according to conditions, inquiring model data according to a main key, modifying model data according to the main key, modifying model data in batches, adding model data in batches, deleting model data according to the main key, deleting model data in batches and aggregating data service;

step S4: when the alarm is triggered, verifying and resetting the alarm threshold value, and eliminating false alarm alarms caused by improper setting, wherein the false alarm alarms are as follows:

step S41: after an alarm is sent out, checking whether false alarm is caused by unreasonable threshold setting;

step S42: according to the dimensions of the application and the interface, checking historical flow data in the past month, and judging whether the current data accords with the history rule;

step S44: if the data does not accord with the history rule, the abnormal situation is indicated to occur, and the next alarm flow needs to be continuously executed;

step S5: if the abnormality is confirmed, an algorithm is called, and an AI model is utilized to find out the reason, specifically, after the abnormal condition is confirmed, the algorithm of an abnormality positioning module is required to be called, and the abnormality positioning module uses the AI model to analyze based on the request link, the log output content and the hardware resource load condition information, so as to give out the reason of the abnormality;

step S6: according to the alarm frequency and the classification of interfaces, the alarm frequency and the notification mode are set, specifically:

step S63: the alarm information is sent to the responsible person of the interface in various modes of enterprise WeChat, weChat public number, E-mail and short message.

2. The method of claim 1, further comprising self-learning threshold adjustment, in particular:

step S71: collecting and storing historical flow data and alarm events;

step S72: learning and modeling the data using a machine learning algorithm;

step S73: and automatically adjusting the threshold value of the alarm according to the prediction result of the model.

3. The traffic data based alert method according to claim 1, further comprising predictive alerting:

step S81: collecting and storing historical flow data and alarm events;

step S82: predicting future flow and alarm events using a predictive machine learning model;

step S83: if the predicted result shows that an alarm event can occur, early warning is sent out in advance, and enough time is given to a service responsible person to prevent the occurrence of the problem.

4. A flow data based alarm system adapted to a flow data based alarm method as claimed in any one of claims 1 to 3, comprising: the system comprises a threshold configuration module, a flow data collection module, a real-time alarm processing module, a threshold elastic calculation module, an abnormality positioning module and an alarm notification and noise reduction module;

the threshold configuration module is responsible for setting an interface alarm threshold;

the flow data collection module is responsible for collecting request data and response time of each function in real time, and asynchronously summarizing flow information accessed by a user into a database of an alarm system in a buried point mode;

the real-time alarm processing module is used for judging whether each flow data reaches a preset alarm threshold value in real time, and if the flow data reaches the threshold value, an alarm needs to be triggered;

the threshold elastic calculation module is used for verifying and resetting the alarm threshold when the alarm is triggered;

the abnormality locating module is used for analyzing reasons through an algorithm after abnormality confirmation;

the alarm notification and noise reduction module is used for selecting an alarm notification mode according to the alarm frequency and the classification of the interface.