CN112416724B - Alarm processing method, system, computer device and storage medium - Google Patents

Alarm processing method, system, computer device and storage medium Download PDF

Info

Publication number
CN112416724B
CN112416724B CN202011401568.6A CN202011401568A CN112416724B CN 112416724 B CN112416724 B CN 112416724B CN 202011401568 A CN202011401568 A CN 202011401568A CN 112416724 B CN112416724 B CN 112416724B
Authority
CN
China
Prior art keywords
data
alarm
baseline
information
transaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011401568.6A
Other languages
Chinese (zh)
Other versions
CN112416724A (en
Inventor
李小波
李琪
赵子健
刘伯松
高昊阳
王�琦
耿金伶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202011401568.6A priority Critical patent/CN112416724B/en
Publication of CN112416724A publication Critical patent/CN112416724A/en
Application granted granted Critical
Publication of CN112416724B publication Critical patent/CN112416724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present disclosure provides an alarm processing method, system, computer device and storage medium. The alarm processing method comprises the following steps: loading baseline data periodically for a predetermined period of time; reading transaction detail data in the message middleware, and acquiring detail information according to the transaction detail data; extracting index data based on business rules and the detail information; calculating business index statistics based on the index data; for the business index statistical data, automatically determining whether to trigger an alarm based on a preset alarm rule and the baseline data, and generating alarm information and giving an alarm when the alarm is triggered; and positioning and processing the abnormal object based on the alarm information. According to the alarm processing method, the alarm and the abnormality analysis processing are automatically triggered through the flexibly configured alarm rules and the baseline data, so that the analysis and investigation time of operation and maintenance personnel on the problems can be reduced, and further, the rapid abnormality processing is realized, and the stability of a system is improved.

Description

Alarm processing method, system, computer device and storage medium
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to an alarm processing method, an alarm processing system, a computer device, and a storage medium.
Background
In the big data fields of banking industry, insurance industry, electronic commerce and the like, each application system generates massive transaction data every day, and the important business system requires the system to maintain high reliability, and the abnormal condition of the system needs to be rapidly positioned and repaired, so that the serious loss caused by failure of the business system is avoided.
With the business's demands for real-time computing, the open source community has developed two mainstream streaming computing frameworks SPARK STREAMING and Flink. SPARK STREAMING is to process the streaming data in a micro-batch mode, and the external data stream is split according to time, and batch processes the split files. The Flink is a computing engine that processes stream data in a distributed manner, and executes stream data programs in a data parallel and pipelined manner. The flow type framework solves the flow type calculation problem, but when errors occur in the calculation process or monitoring indexes are abnormal, operation and maintenance personnel are required to carry out scientific analysis and investigation, however, because the reasons for causing the problems are usually many, such as physical conditions, human factors, system behaviors, flow factors and the like, the efficiency of the method for manually investigation is low, and the stability of the system cannot be ensured.
At present, a set of unified root cause automatic processing scheme is not provided in the field of streaming computing, but a 'case-issuing' site is recorded when an abnormal situation occurs in the streaming computing process, meanwhile, the computing is kept uninterrupted, and then root cause analysis is performed in an off-line (off-line) mode. If an abnormal situation occurs frequently, a great deal of resources are consumed to record the "case-issuing" site, thereby affecting the throughput of stream calculation, and aggravating the delay of the system, resulting in the reduction of the system operation efficiency and affecting the associated business processing.
Therefore, a set of full-flow on-line root cause automatic processing system is needed, the degree of manual participation is reduced, the automatic triggering of the large data flow processing root cause is supported, and the multi-dimensional expansion is supported through flexible configuration.
Disclosure of Invention
In order to solve the problems or part of the problems in the prior art, the embodiment of the invention provides an alarm processing method, an alarm processing system, a computer device and a storage medium, which automatically trigger alarm and exception analysis processing through flexibly configured alarm rules and baseline data, can reduce analysis and investigation time of operation and maintenance personnel on the problems, and further realize rapid exception processing so as to improve the stability of the system.
According to a first aspect of the present invention, an embodiment of the present invention provides an alarm processing method, including: loading baseline data periodically for a predetermined period of time; reading transaction detail data in the message middleware, and acquiring detail information according to the transaction detail data; extracting index data based on business rules and the detail information; calculating business index statistics based on the index data; for the business index statistical data, automatically determining whether to trigger an alarm based on a preset alarm rule and the baseline data, and generating alarm information and giving an alarm when the alarm is triggered; and positioning and processing the abnormal object based on the alarm information.
According to the embodiment of the invention, whether to trigger the alarm is automatically determined according to the preset alarm rule and the baseline data, and the alarm and the abnormal analysis processing are carried out according to the generated alarm information, so that the alarm and the root cause analysis can be automatically triggered, the analysis and investigation time of operation and maintenance personnel on the problems is reduced, and the work efficiency of departments is improved.
In some embodiments of the present invention, the alarm processing method further includes: storing the business index statistical data into the message middleware; and storing the business index statistical data in the message middleware into a data memory.
The embodiment of the invention realizes the buffering and storage of the business index statistical data through the message middleware and the data storage.
In some embodiments of the present invention, the alarm processing method further includes: the baseline data is calculated based on historical traffic indicator statistics in the data store over the predetermined period of time.
According to the embodiment of the invention, the baseline data is calculated according to the historical business index statistical data in a period of time, and the baseline data can be updated according to the historical data, so that more effective warning is realized according to the baseline data.
In some embodiments of the present invention, the alarm processing method further includes: acquiring a transaction data file according to the acquisition rule and the acquisition frequency; the transaction detail data are obtained after format verification, data analysis, data filtering and associated information supplementation are carried out on the transaction data file; and storing the transaction detail data into the message middleware.
In some embodiments of the present invention, the traffic index statistics include: transaction amount, service success rate, system success rate, average response time, average processing time, long transaction amount, long transaction rate, application performance index.
According to the embodiment of the invention, the system can be subjected to multi-dimensional monitoring and alarm processing by acquiring the statistical data of various service indexes.
In some embodiments of the invention, calculating the baseline data based on historical traffic indicator statistics over the predetermined period of time in the data store comprises: removing abnormal data in the historical business index statistical data in the preset time period; and calculating the baseline data by adopting an average algorithm for the historical business index statistical data after the abnormal data is removed.
According to the embodiment of the invention, the more accurate baseline data can be obtained by eliminating the abnormal data, so that the alarm generated based on the baseline data is more accurate.
In some embodiments of the present invention, the preset alert rule includes: alarm time period, alarm level, whether to send alarm information, alarm suppression times, upper baseline threshold value and lower baseline threshold value.
The embodiment of the invention can realize various alarm requirements by setting various alarm rules.
In some embodiments of the invention, locating the abnormal object based on the alert information includes: acquiring an alarm index and alarm time from the alarm information; locating an abnormal physical machine based on the alarm index; and searching log information according to the alarm time to determine abnormal information.
According to the embodiment of the invention, the abnormal physical machine is automatically positioned according to the alarm information, and the abnormal information is determined by searching the log, so that the timely analysis of the abnormal root cause can be realized, and further, the operation and maintenance personnel can perform abnormal processing, so that the stability and the operation efficiency of the system are ensured.
According to a second aspect of the present invention, an embodiment of the present invention provides an alarm processing system, including: the baseline loading module is used for loading baseline data in a preset time period in a timing way; the stream computing module is used for reading the transaction detail data in the message middleware and acquiring the detail information according to the transaction detail data; the flow computing module is also used for extracting index data based on the business rules and the detail information and computing business index statistical data based on the index data; the alarm triggering module is used for automatically determining whether to trigger an alarm or not based on a preset alarm rule and the baseline data according to the business index statistical data, and generating alarm information and giving an alarm when the alarm is triggered; and the root cause analysis module is used for positioning the abnormal object based on the alarm information and processing the abnormal object.
According to the embodiment of the invention, whether to trigger the alarm is automatically determined according to the preset alarm rule and the baseline data, and the alarm and the abnormal analysis processing are carried out according to the generated alarm information, so that the alarm and the root cause analysis can be automatically triggered, the analysis and investigation time of operation and maintenance personnel on the problems is reduced, and the work efficiency of departments is improved.
In some embodiments of the invention, the alert processing system further comprises: the message middleware is used for storing the business index statistical data; and the data storage is used for storing the business index statistical data sent by the message middleware.
The embodiment of the invention realizes the buffering and storage of the business index statistical data through the message middleware and the data storage.
In some embodiments of the invention, the alert processing system further comprises: and the baseline calculation module is used for calculating the baseline data based on historical business index statistical data in the data storage within the preset time period.
According to the embodiment of the invention, the baseline data is calculated according to the historical business index statistical data in a period of time, and the baseline data can be updated according to the historical data, so that more effective warning is realized according to the baseline data.
In some embodiments of the invention, the alert processing system further comprises: the data collector is used for acquiring transaction data files according to the collection rules and the collection frequency and sending the transaction data files to the data forwarder; the data forwarder is used for receiving the transaction data file sent by the data collector, carrying out format verification, data analysis, data filtering and associated information supplementation on the transaction data file to obtain the transaction detail data, and storing the transaction detail data into the message middleware.
In some embodiments of the present invention, the traffic index statistics include: transaction amount, service success rate, system success rate, average response time, average processing time, long transaction amount, long transaction rate, application performance index.
According to the embodiment of the invention, the system can be subjected to multi-dimensional monitoring and alarm processing by acquiring the statistical data of various service indexes.
In some embodiments of the invention, calculating the baseline data based on historical traffic indicator statistics over the predetermined period of time in the data store comprises: removing abnormal data in the historical business index statistical data in the preset time period; and calculating the baseline data by adopting an average algorithm for the historical business index statistical data after the abnormal data is removed.
According to the embodiment of the invention, the more accurate baseline data can be obtained by eliminating the abnormal data, so that the alarm generated based on the baseline data is more accurate.
In some embodiments of the present invention, the preset alert rule includes: alarm time period, alarm level, whether to send alarm information, alarm suppression times, upper baseline threshold value and lower baseline threshold value.
The embodiment of the invention can realize various alarm requirements by setting various alarm rules.
In some embodiments of the invention, locating the abnormal object based on the alert information includes: acquiring an alarm index and alarm time from the alarm information; locating an abnormal physical machine based on the alarm index; and searching log information according to the alarm time to determine abnormal information.
According to the embodiment of the invention, the abnormal physical machine is automatically positioned according to the alarm information, and the abnormal information is determined by searching the log, so that the timely analysis of the abnormal root cause can be realized, and further, the operation and maintenance personnel can perform abnormal processing, so that the stability and the operation efficiency of the system are ensured.
According to a third aspect of the present invention, embodiments provide a computer storage medium having stored thereon computer readable instructions which, when executed by a processor, cause a computer to perform the operations of: the operations include steps involved in the alert processing method as described in any of the embodiments above.
According to a fourth aspect of the present invention, embodiments of the present invention provide a computer device comprising a memory and a processor, the memory being configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, enable the alarm processing method according to any one of the embodiments above.
As can be seen from the foregoing, the alarm processing method, system, storage medium and computer device provided by the embodiments of the present invention automatically determine whether to trigger an alarm according to preset alarm rules and baseline data, and perform alarm and exception analysis processing according to generated alarm information, so that the alarm and root cause analysis can be automatically triggered, the analysis and investigation time of the operation and maintenance personnel on the problem is reduced, the rapid positioning and repair are performed, the significant damage caused by failure of the service system is avoided, and the high reliability of the system is ensured.
Drawings
FIG. 1 is a flow diagram of an alarm processing method according to one embodiment of the invention;
FIG. 2 is a flow chart of overall data flow in an alarm processing method according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating a method of flow calculation according to FIG. 2;
FIG. 4 is an architecture diagram of an alarm processing system according to one embodiment of the invention.
Detailed Description
Various aspects of the invention are described in detail below with reference to the drawings and detailed description. Well-known modules, units, and their connections, links, communications, or operations between each other are not shown or described in detail. Also, the described features, architectures, or functions may be combined in any manner in one or more implementations. It will be appreciated by those skilled in the art that the various embodiments described below are for illustration only and are not intended to limit the scope of the invention. It will be further appreciated that the modules or units or processes of the embodiments described herein and illustrated in the drawings may be combined and designed in a wide variety of different configurations.
The following is a brief description of the terminology used herein.
Real-time data: a carrier of information obtained at the same time in the process of occurrence and development of a certain thing, and raw material for representing that the objective thing is not processed.
Stream processing: the borderless dataset is continuously processed, aggregated and analyzed. Borderless data is a growing set of data without boundaries, which cannot determine when to terminate, also called streaming data.
Root cause analysis: when a certain macro index is abnormal, the index of which specific fine granularity is abnormal is quickly positioned to cause the abnormality.
Baseline: the baseline, historical data calculated results as a benchmark for comparison.
Apdex: application Performance Index, an application performance index, reflecting the overall health status of the application.
Robustness: transliteration of Robust refers to the control system maintaining certain other performance characteristics under perturbation of certain parameters.
Elastic search: a search server provides a distributed multi-user capable full text search engine.
Kibana: an open source analysis and visualization platform for use with an elastic search.
Fig. 1 is a flow chart illustrating an alarm processing method according to an embodiment of the present invention.
As shown in fig. 1, in one embodiment of the present invention, the method may include: step S11, step S12, step S13, step S14, step S15, and step S16, which are specifically described below.
In step S11, baseline data for a predetermined period of time is loaded periodically.
In step S12, transaction detail data in the message middleware is read, and detail information is acquired according to the transaction detail data.
In step S13, index data is extracted based on the business rule and the detail information.
In step S14, traffic index statistics are calculated based on the index data. In alternative embodiments, the traffic index statistics may include, but are not limited to, the following: transaction amount, service success rate, system success rate, average response time, average processing time, long transaction amount, long transaction rate, application performance index.
According to the above alternative embodiments, the present invention describes exemplary statistics of several business indexes as follows:
(1) Transaction amount: the transaction amount is the number of business transactions counted within 1 minute (or 10 seconds);
(2) Service success rate: the amount of successful business transactions is in the ratio of the total transaction amount within a statistical time of 1 minute (or 10 seconds);
(3) The success rate of the system is as follows: the number of successful system transactions is in the ratio of the total transaction amount within a statistical time of 1 minute (or 10 seconds);
(4) Average response time: the system response time is the ratio of the total transaction time within a statistical time of 1 minute (or 10 seconds);
(5) Average treatment time: the ratio of business processing time to total transaction time within a statistical time of 1 minute (or 10 seconds);
(6) Long transaction amount: long transactions refer to a transaction with a processing time greater than a threshold, and long transactions refer to the number of long transactions within a statistical time of 1 minute (or 10 seconds);
(7) Long transaction rate: the ratio of the long transaction amount to the total transaction amount in a statistical time of 1 minute (or 10 seconds);
(8) Application performance index (Apdex): the average transaction quality value for the transaction is counted over a1 minute (or 10 seconds) counting time. Wherein the transaction quality is defined by:
first, a Apdex threshold is defined for each system, and the transaction quality is as follows:
the processing time is greater than Apdex threshold, and the quality value is equal to 0;
The treatment time is 75% -100% Apdex threshold, and the mass value is 50;
The treatment time is 25% -75% Apdex threshold, and the mass value is 75;
the treatment time is between 0% and 25% apdex threshold, mass value 100.
In step S15, for the business index statistics data, whether to trigger an alarm is automatically determined based on a preset alarm rule and the baseline data, and when the alarm is triggered, alarm information is generated and the alarm is performed.
Optionally, the preset alert rules include, but are not limited to: alarm time period, alarm level, whether to send alarm information, alarm suppression times, upper baseline threshold value and lower baseline threshold value. The upper baseline threshold and the lower baseline threshold are determined based on baseline data generated by two algorithms, wherein the baseline data obtained by one algorithm is larger, the upper baseline threshold is correspondingly determined, the baseline data obtained by the other algorithm is smaller, and the lower baseline threshold is correspondingly determined.
In step S16, an abnormal object is located and processed based on the alarm information. In an alternative embodiment, locating the abnormal object based on the alarm information may specifically include: acquiring an alarm index and alarm time from the alarm information; locating an abnormal physical machine based on the alarm index; and searching log information according to the alarm time to determine abnormal information.
By adopting the method of the embodiment of the invention, whether to trigger the alarm is automatically determined according to the preset alarm rule and the baseline data, and the alarm and the abnormal analysis processing are carried out according to the generated alarm information, so that the alarm and the root cause analysis can be automatically triggered, the analysis and investigation time of operation and maintenance personnel on the problem is reduced, the operation and maintenance personnel can rapidly locate and repair the problem, the serious damage caused by the failure of a service system is avoided, and the high reliability of the system is ensured.
In an alternative embodiment, the business index statistics are stored in the message middleware; and storing the business index statistical data in the message middleware into a data memory. And buffering and storing the business index statistical data through the message middleware and the data storage. Optionally, the baseline data is calculated based on historical traffic indicator statistics over the predetermined period of time in the data store. The baseline data is calculated according to the historical business index statistical data in a period of time, the baseline data can be updated according to the historical data, and then more effective alarming is realized according to the baseline data.
In other alternative embodiments, calculating the baseline data based on historical traffic indicator statistics in the data store over the predetermined period of time may include: removing abnormal data in the historical business index statistical data in the preset time period; and calculating the baseline data by adopting an average algorithm for the historical business index statistical data after the abnormal data is removed. Optionally, for robustness of the system, not only abnormal data in the historical traffic index statistical data can be removed, but also a plurality of algorithm fusion modes can be adopted to calculate Baseline (Baseline data). The calculation method of the baseline data needs to be determined according to the business scenario, for example, by calculating the average value of the transaction amount in the same time period in the past 15 days, and the average value is taken as the baseline value of the transaction amount in the time period. Furthermore, since baseline is continuously changed over time and the data amount of baseline is huge, it is necessary to first time-load baseline data for a predetermined period of time in step S11 when practicing the present invention.
In another alternative embodiment, the transaction data file is obtained according to the collection rules and collection frequency; the transaction detail data are obtained after format verification, data analysis, data filtering and associated information supplementation are carried out on the transaction data file; and storing the transaction detail data into the message middleware.
The present invention provides an example of overall data flow when implementing alarm processing according to the above-mentioned alarm processing method, and fig. 2 is a schematic flow chart of overall data flow in the alarm processing method according to an embodiment of the present invention.
As shown in fig. 2, the alarm processing method according to the embodiment of the present invention includes the processing procedures of a data collector, a data repeater, a message queue, stream computation, and a data memory for data.
The data acquisition device acquires and acquires the data files meeting the conditions in real time by configuring acquisition rules and acquisition frequencies, and sends the content of the data files meeting the conditions to the data repeater. Optionally, the data collector is deployed at a client of each service application, and is configured to collect/collect transaction data generated by the service application.
The data forwarder receives the data file sent by the data collector, performs verification, data analysis, data filtering and data related information supplementation on the file format, and then sends the processed transaction detail data to the message middleware.
In an alternative embodiment, the verification of the file format may include: each row of the data file is used as a transaction detail record, and a key value pair mode is adopted: key < value > implements the check. For example, the check for the global event trace number trn is: trn <1020011011386988024816667>, the value of the global event tracking number trn is 1020011011386988024816667. If the data file content does not conform to this format, the transaction detail record is discarded.
In another alternative embodiment, the data parsing may include: and analyzing each row of transaction detail data into objects identifiable by the program.
In other alternative embodiments, data filtering may include: and judging whether the value of each attribute in the transaction detail data is legal or not. For example: the date requirements are all digital and legal; the telephone requirement is a number; the coordinate requirement is a floating point number; the mandatory field cannot be empty.
In other alternative embodiments, the supplementing of the data-related information may include: supplementary deployment unit and physical subsystem, etc. At present, a plurality of APs (virtual machines) form a deployment unit, a plurality of deployment units form a physical subsystem, and when the deployment units or the physical subsystem dimension are counted, the configuration information is stored in a database.
Message middleware generally adopts a distributed publish-subscribe message system with high throughput, is used for buffering data, and the main stream message middleware has kafka, rocketmq and the like.
Stream-type calculation: the business logic of the core is processed through a streaming computing framework (e.g., flink, SPARK STREAMING, etc.) and the results of the computation are pushed to message middleware.
And the data storage is used for storing the streaming calculation result in the message middleware. Distributed full text search engines are commonly used for data storage, for example, using an elastic search, through kibana, in conjunction with data queries.
Fig. 3 is a flow chart illustrating a method of streaming computation according to fig. 2.
As shown in fig. 3, in one embodiment of the present invention, the streaming calculation method may include: step S301, step S302, step S303, step S304, step S305, step S306, step S307, step S308, step S309, step S310, step S311, step S312, step S313, step S314, step S315, and step S316, which are specifically described below.
In step S301, baseline is loaded periodically. In an alternative embodiment, baseline is calculated from data in a predetermined historical period of time in the data store, so baseline will be continuously transformed over time, and the amount of baseline data is large, requiring timed loading of baseline for the predetermined period of time.
In step S302, transaction detail data is read. In an alternative embodiment, the transaction detail data is read from a message queue of the message middleware.
In step S303, the transaction detail data is parsed. In an alternative embodiment, the transaction detail data is parsed and associated information is supplemented for statistics of individual dimension indicators.
In step S304, it is determined whether the transaction detail data is legal data, if yes, step S305 is executed, otherwise, the transaction detail data is removed.
In step S305, transaction detail data is stored into the message middleware.
In step S306, the traffic index data is extracted. In an alternative embodiment, the extraction of the business index data is performed based on the transaction detail information and business rules obtained from the transaction detail data.
In step S307, traffic index statistics are calculated based on the index data. In alternative embodiments, the business index data may include, but is not limited to, the following: transaction amount, service success rate, system success rate, average response time, average processing time, long transaction amount, long transaction rate, application performance index.
In step S308, the traffic index statistics are stored to the message middleware. In an alternative embodiment, the business metric statistics stored in the message middleware are ultimately stored to a data store (distributed full-text search engine) providing historical query and baseline computing usage.
In step S309, the alert rule is loaded. In an alternative embodiment, various alert rules are configured by way of web pages, which may include, but are not limited to: alert period, alert level, whether to send alert information, alert withholding times, upper baseline threshold, lower baseline threshold, etc. Optionally, for different systems to configure the same alarm rule, unified configuration management can be adopted in a package manner. The configuration workload is greatly reduced.
Optionally, the alarm information is generated based on the alarm index, the alarm subsystem and the alarm time when the alarm is triggered.
In step S310, it is determined whether a corresponding index alarm rule exists for the current business index statistical data, when a corresponding index alarm rule exists, step S311 is executed, otherwise, the alarm determination for the business index statistical data is exited.
In step S311, it is determined whether a corresponding indicator baseline exists for the current traffic indicator statistics, and when the corresponding baseline exists, step S312 is executed, otherwise, the alarm determination for the traffic indicator statistics is exited.
In step S312, it is determined whether an alarm is triggered based on the index alarm rule and baseline, when an alarm is triggered, step S313 is executed, otherwise, the alarm determination for the traffic index statistical data is exited.
In step S313, an alarm notification is pushed.
In step S314, if root cause analysis is triggered, when root cause analysis is triggered, step S315 is executed, otherwise, the business index statistical data is not processed.
In step S315, root cause analysis. In an alternative embodiment, the cause of the alarm is analyzed and a problem log is located based on the alarm information.
In another alternative embodiment, root cause analysis may include the steps of: firstly, analyzing alarm information to obtain an alarm object and alarm time; secondly, locating abnormal physical subsystem and machine information; thirdly, searching logs stored on the distributed file in the alarm period, and acquiring abnormal log information; and finally, recording positioning result information. Therefore, root cause analysis can be automatically triggered, and quick problem positioning is realized.
The present invention provides an example of a localization problem according to the root cause analysis method described above:
When the service success rate of a certain system is found to be 89% through the overview view, the system automatically drills a second-level view (transaction code view) because the service success rate is lower than the normal requirement (for example: 99%), performs reverse arrangement according to the service success rate, acquires the transaction code with low service success rate, drills a third-level view (AP view, view of a virtual machine), performs reverse arrangement according to the service success rate, positions to specific APs with low service success rate, and searches for the problem of positioning the wrong keywords by searching logs on the corresponding APs.
In step S316, a root cause analysis result is generated based on the positioning result information.
By adopting the method of the embodiment of the invention, whether to trigger the alarm is automatically determined according to the preset alarm rule and the baseline data, and the alarm and the abnormal analysis processing are carried out according to the generated alarm information, so that the alarm and the root cause analysis can be automatically triggered, the analysis and investigation time of operation and maintenance personnel on the problem is reduced, the operation and maintenance personnel can rapidly locate and repair the problem, the serious damage caused by the failure of a service system is avoided, and the high reliability of the system is ensured.
FIG. 4 is an architecture diagram of an alarm processing system according to one embodiment of the invention.
As shown in fig. 4, the alarm system includes:
the data collector 410 is configured to obtain the transaction data file according to the collection rule and the collection frequency, and send the transaction data file to the headend 420.
The data repeater 420 is configured to receive the transaction data file sent by the data collector 410, perform format verification, data analysis, data filtering and association information supplementation on the transaction data file to obtain the transaction detail data, and store the transaction detail data in the message middleware 430.
Message middleware 430 stores the transaction details data sent by headend 420 and the traffic metrics statistics sent by stream computation module 440.
The flow calculation module 440 is configured to read the transaction detail data in the message middleware 430, and obtain the detail information according to the transaction detail data. In addition, the stream calculation module 440 is further configured to extract index data based on the business rules and the detail information, calculate business index statistics based on the index data, and store the business index statistics in the message middleware 430.
In alternative embodiments, the traffic index statistics may include, but are not limited to, the following: transaction amount, service success rate, system success rate, average response time, average processing time, long transaction amount, long transaction rate, application performance index. By acquiring statistical data of various service indexes, the system can be subjected to multidimensional monitoring and alarm processing.
A data storage 450 for storing traffic index statistics sent by the message middleware 430.
The baseline calculation module 460 is configured to calculate baseline data based on historical traffic indicator statistics over a predetermined period of time in the data store 450. In an alternative embodiment, calculating baseline data based on historical business metric statistics over a predetermined period of time in data store 450 may include: removing abnormal data in the historical business index statistical data in the preset time period; and calculating the baseline data by adopting an average algorithm for the historical business index statistical data after the abnormal data is removed. Optionally, for robustness of the system, not only abnormal data in the historical traffic index statistical data can be removed, but also a plurality of algorithm fusion modes can be adopted to calculate Baseline (Baseline data). The calculation method of the baseline data needs to be determined according to the business scenario, for example, by calculating the average value of the transaction amount in the same time period in the past 15 days, and the average value is taken as the baseline value of the transaction amount in the time period. Furthermore, since baseline is continuously changing over time and the amount of baseline data is enormous, baseline data over a predetermined period of time needs to be loaded periodically by baseline loading module 470 in practicing the present invention.
The baseline loading module 470 is configured to load baseline data in a predetermined period of time.
And the alarm triggering module 480 is used for automatically determining whether to trigger an alarm based on preset alarm rules and baseline data according to the business index statistical data, and generating alarm information and alarming when the alarm is triggered. The upper baseline threshold and the lower baseline threshold are determined based on baseline data generated by two algorithms, wherein the baseline data obtained by one algorithm is larger, the upper baseline threshold is correspondingly determined, the baseline data obtained by the other algorithm is smaller, and the lower baseline threshold is correspondingly determined.
In alternative embodiments, the preset alert rules include, but are not limited to: alarm time period, alarm level, whether to send alarm information, alarm suppression times, upper baseline threshold value and lower baseline threshold value. By setting various alarm rules, various alarm requirements can be realized.
The root cause analysis module 490 is configured to locate and process the abnormal object based on the alarm information. In an alternative embodiment, locating the abnormal object based on the alarm information may specifically include: acquiring an alarm index and alarm time from the alarm information; locating an abnormal physical machine based on the alarm index; and searching log information according to the alarm time to determine abnormal information.
By adopting the system of the embodiment of the invention, whether to trigger an alarm is automatically determined according to the preset alarm rule and the baseline data, and alarm and abnormal analysis processing is carried out according to the generated alarm information, so that the alarm and root cause analysis can be automatically triggered, the analysis and investigation time of operation and maintenance personnel on the problem is reduced, and the system is rapidly positioned and repaired, so that the stability and the operation efficiency of the system are ensured.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software in combination with a hardware platform. With such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in the various embodiments or parts of the embodiments of the present invention.
Correspondingly, the embodiment of the invention also provides a computer readable storage medium, on which computer readable instructions or programs are stored, which when executed by a processor, cause the computer to perform the following operations: the operations include steps included in the alarm processing method according to any one of the foregoing embodiments, which are not described herein. Wherein the storage medium may include: such as optical disks, hard disks, floppy disks, flash memory, magnetic tape, etc.
In addition, the embodiment of the invention also provides a computer device comprising a memory and a processor, wherein the memory is used for storing one or more computer instructions or programs, and the alarm processing method according to any one of the embodiments can be realized when the one or more computer instructions or programs are executed by the processor. The computer device may be, for example, a server, a desktop computer, a notebook computer, or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting thereof; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention. The scope of the invention should therefore be pointed out in the appended claims.

Claims (16)

1. An alarm processing method, characterized in that the alarm processing method comprises:
Loading baseline data periodically for a predetermined period of time;
acquiring a transaction data file according to the acquisition rule and the acquisition frequency;
Carrying out format verification, data analysis, data filtering and associated information supplementation on the transaction data file to obtain transaction detail data;
Storing the transaction detail data in a message middleware;
Reading transaction detail data in the message middleware, and acquiring detail information according to the transaction detail data;
Extracting index data based on business rules and the detail information;
Calculating business index statistics based on the index data;
for the business index statistical data, automatically determining whether to trigger an alarm based on a preset alarm rule and the baseline data, and generating alarm information and giving an alarm when the alarm is triggered;
and positioning and processing the abnormal object based on the alarm information.
2. The alert processing method as recited in claim 1, wherein the alert processing method further comprises:
storing the business index statistical data into the message middleware;
And storing the business index statistical data in the message middleware into a data memory.
3. The alert processing method as recited in claim 2, wherein the alert processing method further comprises:
The baseline data is calculated based on historical traffic indicator statistics in the data store over the predetermined period of time.
4. The alert processing method of claim 1, wherein the traffic indicator statistics include: transaction amount, service success rate, system success rate, average response time, average processing time, long transaction amount, long transaction rate, application performance index.
5. The alert processing method as recited in claim 3 wherein calculating the baseline data based on historical traffic indicator statistics in the data store for the predetermined period of time comprises:
removing abnormal data in the historical business index statistical data in the preset time period;
and calculating the baseline data by adopting an average algorithm for the historical business index statistical data after the abnormal data is removed.
6. The alert processing method as claimed in claim 1, wherein the preset alert rule includes: alarm time period, alarm level, whether to send alarm information, alarm suppression times, upper baseline threshold value and lower baseline threshold value.
7. The alert processing method of claim 1, wherein locating an abnormal object based on the alert information comprises:
Acquiring an alarm index and alarm time from the alarm information;
locating an abnormal physical machine based on the alarm index;
And searching log information according to the alarm time to determine abnormal information.
8. An alarm processing system, the alarm processing system comprising:
The baseline loading module is used for loading baseline data in a preset time period in a timing way;
The data collector is used for acquiring transaction data files according to the collection rules and the collection frequency and sending the transaction data files to the data forwarder;
The data forwarder is used for receiving the transaction data file sent by the data collector, carrying out format verification, data analysis, data filtering and associated information supplementation on the transaction data file to obtain transaction detail data, and storing the transaction detail data into the message middleware;
The stream computing module is used for reading the transaction detail data in the message middleware and acquiring detail information according to the transaction detail data;
The flow computing module is also used for extracting index data based on the business rules and the detail information and computing business index statistical data based on the index data;
the alarm triggering module is used for automatically determining whether to trigger an alarm or not based on a preset alarm rule and the baseline data according to the business index statistical data, and generating alarm information and giving an alarm when the alarm is triggered;
and the root cause analysis module is used for positioning the abnormal object based on the alarm information and processing the abnormal object.
9. The alert processing system of claim 8, wherein the alert processing system further comprises:
the message middleware is used for storing the business index statistical data;
and the data storage is used for storing the business index statistical data sent by the message middleware.
10. The alert processing system of claim 9, wherein the alert processing system further comprises:
And the baseline calculation module is used for calculating the baseline data based on historical business index statistical data in the data storage within the preset time period.
11. The alert processing system of claim 8, wherein the business metric statistics comprise: transaction amount, service success rate, system success rate, average response time, average processing time, long transaction amount, long transaction rate, application performance index.
12. The alert processing system according to claim 10, wherein calculating the baseline data based on historical business metric statistics over the predetermined period of time in the data store comprises:
removing abnormal data in the historical business index statistical data in the preset time period;
and calculating the baseline data by adopting an average algorithm for the historical business index statistical data after the abnormal data is removed.
13. The alert processing system of claim 8, wherein the preset alert rule comprises: alarm time period, alarm level, whether to send alarm information, alarm suppression times, upper baseline threshold value and lower baseline threshold value.
14. The alert processing system of claim 8, wherein locating an abnormal object based on the alert information comprises:
Acquiring an alarm index and alarm time from the alarm information;
locating an abnormal physical machine based on the alarm index;
And searching log information according to the alarm time to determine abnormal information.
15. A computer storage medium storing computer software instructions for execution by a processor to implement the alarm processing method of any one of claims 1-7.
16. A computer device comprising a memory and a processor;
the processor is configured to execute one or more computer instructions to implement the alarm processing method of any of claims 1-7.
CN202011401568.6A 2020-12-04 2020-12-04 Alarm processing method, system, computer device and storage medium Active CN112416724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011401568.6A CN112416724B (en) 2020-12-04 2020-12-04 Alarm processing method, system, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011401568.6A CN112416724B (en) 2020-12-04 2020-12-04 Alarm processing method, system, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN112416724A CN112416724A (en) 2021-02-26
CN112416724B true CN112416724B (en) 2024-05-07

Family

ID=74830030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011401568.6A Active CN112416724B (en) 2020-12-04 2020-12-04 Alarm processing method, system, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN112416724B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190415A (en) * 2021-05-27 2021-07-30 北京京东拓先科技有限公司 Internet hospital system monitoring method, equipment, storage medium and program product
CN113590615A (en) * 2021-07-15 2021-11-02 福建星云检测技术有限公司 Vulnerable part management method and system
CN113590437B (en) * 2021-08-03 2024-04-30 上海浦东发展银行股份有限公司 Alarm information processing method, device, equipment and medium
CN113590427B (en) * 2021-08-09 2024-05-03 中国建设银行股份有限公司 Alarm method, device, storage medium and equipment for monitoring index abnormality
CN115080366B (en) * 2022-08-22 2022-11-15 深圳依时货拉拉科技有限公司 Alarm method, alarm device, computer equipment and storage medium
CN117151658B (en) * 2023-10-31 2024-02-23 智业软件股份有限公司 Method for realizing critical value processing measure recommendation and trigger range calibration
CN117692302B (en) * 2024-02-02 2024-05-28 深圳感臻智能股份有限公司 Method and system for data collection, storage and intelligent monitoring and alarming

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105610647A (en) * 2015-12-30 2016-05-25 华为技术有限公司 Service abnormity detection method and server
CN109688188A (en) * 2018-09-07 2019-04-26 平安科技(深圳)有限公司 Monitoring alarm method, apparatus, equipment and computer readable storage medium
CN110275815A (en) * 2019-06-30 2019-09-24 深圳前海微众银行股份有限公司 A kind of system exception alert processing method and device
US10445738B1 (en) * 2018-11-13 2019-10-15 Capital One Services, Llc Detecting a transaction volume anomaly
CN110389989A (en) * 2019-07-15 2019-10-29 阿里巴巴集团控股有限公司 A kind of data processing method, device and equipment
CN111192130A (en) * 2019-12-11 2020-05-22 中国建设银行股份有限公司 Method, system, device and storage medium for determining fault source in transaction monitoring
CN111506478A (en) * 2020-04-17 2020-08-07 上海浩方信息技术有限公司 Method for realizing alarm management control based on artificial intelligence
CN111639011A (en) * 2020-06-11 2020-09-08 支付宝(杭州)信息技术有限公司 Data monitoring method, device and equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105610647A (en) * 2015-12-30 2016-05-25 华为技术有限公司 Service abnormity detection method and server
CN109688188A (en) * 2018-09-07 2019-04-26 平安科技(深圳)有限公司 Monitoring alarm method, apparatus, equipment and computer readable storage medium
US10445738B1 (en) * 2018-11-13 2019-10-15 Capital One Services, Llc Detecting a transaction volume anomaly
CN110275815A (en) * 2019-06-30 2019-09-24 深圳前海微众银行股份有限公司 A kind of system exception alert processing method and device
CN110389989A (en) * 2019-07-15 2019-10-29 阿里巴巴集团控股有限公司 A kind of data processing method, device and equipment
CN111192130A (en) * 2019-12-11 2020-05-22 中国建设银行股份有限公司 Method, system, device and storage medium for determining fault source in transaction monitoring
CN111506478A (en) * 2020-04-17 2020-08-07 上海浩方信息技术有限公司 Method for realizing alarm management control based on artificial intelligence
CN111639011A (en) * 2020-06-11 2020-09-08 支付宝(杭州)信息技术有限公司 Data monitoring method, device and equipment

Also Published As

Publication number Publication date
CN112416724A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112416724B (en) Alarm processing method, system, computer device and storage medium
CN110661659B (en) Alarm method, device and system and electronic equipment
CN110058977B (en) Monitoring index abnormity detection method, device and equipment based on stream processing
CN110928718B (en) Abnormality processing method, system, terminal and medium based on association analysis
EP1812863B1 (en) Reporting of abnormal computer resource utilization data
US11010223B2 (en) Method and system of automatic event and error correlation from log data
US10031829B2 (en) Method and system for it resources performance analysis
CN110708204A (en) Abnormity processing method, system, terminal and medium based on operation and maintenance knowledge base
US20210165708A1 (en) Systems and methods for predictive system failure monitoring
CN111078432B (en) Tracking method and device for scheduling between services
CN111881011A (en) Log management method, platform, server and storage medium
CN113590556A (en) Database-based log processing method, device and equipment
CN111858274B (en) Stability monitoring method for big data scoring system
CN112699007A (en) Method, system, network device and storage medium for monitoring machine performance
CN106951360B (en) Data statistical integrity calculation method and system
CN116755992B (en) Log analysis method and system based on OpenStack cloud computing
CN112463834A (en) Method and device for automatically realizing root cause analysis in streaming processing and electronic equipment
CN110011845B (en) Log collection method and system
Gu et al. Online failure forecast for fault-tolerant data stream processing
CN109522349B (en) Cross-type data calculation and sharing method, system and equipment
US11822578B2 (en) Matching machine generated data entries to pattern clusters
CN116701525A (en) Early warning method and system based on real-time data analysis and electronic equipment
CN111737242A (en) Method for monitoring mass data processing process
CN111835566A (en) System fault management method, device and system
Chen et al. The exploration of machine learning for abnormal prediction model of telecom business support system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant