CN115809179A - Alarm method, system, equipment and storage medium based on application performance data - Google Patents

Alarm method, system, equipment and storage medium based on application performance data Download PDF

Info

Publication number
CN115809179A
CN115809179A CN202211593180.XA CN202211593180A CN115809179A CN 115809179 A CN115809179 A CN 115809179A CN 202211593180 A CN202211593180 A CN 202211593180A CN 115809179 A CN115809179 A CN 115809179A
Authority
CN
China
Prior art keywords
alarm
application
application service
fault
performance data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211593180.XA
Other languages
Chinese (zh)
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Happy Consumption Finance Co ltd
Original Assignee
Hebei Happy Consumption Finance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Happy Consumption Finance Co ltd filed Critical Hebei Happy Consumption Finance Co ltd
Priority to CN202211593180.XA priority Critical patent/CN115809179A/en
Publication of CN115809179A publication Critical patent/CN115809179A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides an alarm method, system, equipment and storage medium based on application performance data. The alarm method comprises the following steps: acquiring an alarm monitoring rule to be updated aiming at an application service, an application example and an application interface and determining a node corresponding to the alarm monitoring rule to be updated; when the number of the nodes to be synchronized is the same as that of the corresponding nodes, the alarm monitoring rule to be updated takes effect; and when the number of times that the acquired application performance data exceeds the alarm threshold is more than or equal to the alarm number of times, sending alarm information. The alarm method of the invention can realize multi-dimensional and fine-grained monitoring unified management by acquiring the new alarm monitoring rules of multiple dimensions and enabling the new rules pulled by all nodes to take effect after all corresponding nodes acquire the rules. And when the collected application performance data triggers an alarm rule, alarm information is sent in time, so that automatic and near-real-time active early warning can be realized.

Description

Alarm method, system, equipment and storage medium based on application performance data
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to an alarm method, system, device, and storage medium based on application performance data.
Background
With the increasing development and deployment of systems by companies, and a system usually consists of many different applications, the calling relationship between applications and systems becomes complicated because necessary communication is required between applications and between systems to meet business requirements. The traditional performance monitoring technology mainly aims at the performance monitoring of single application, but the monitoring of a host, a network, a database and the single application is isolated and independent monitoring and management, and the performance of the whole system cannot be identified or the root cause problem of the whole system cannot be solved. Meanwhile, due to the lack of a complete monitoring system and the capability of rapid fault location, the conventional performance monitoring technology cannot learn the abnormal instances and services at the first time, so that the implementation of daily troubleshooting is difficult.
In addition, the current monitoring technology has the problems of single index, coarse monitoring granularity, poor timeliness of alarm monitoring, incapability of meeting the real-time service monitoring requirement and the like.
Disclosure of Invention
In order to solve the problems or some problems in the prior art, embodiments of the present invention provide an alarm method, system, device and storage medium based on application performance data, which can implement multidimensional, fine-grained monitoring and unified management, and when the acquired application performance data triggers an alarm rule, send alarm information in time, so as to implement automatic, near-real-time active early warning, and avoid causing serious business impact.
According to a first aspect of the present invention, an embodiment of the present invention provides an alarm method based on application performance data, which includes: acquiring an alarm monitoring rule to be updated aiming at an application service, an application example and an application interface and determining a node corresponding to the alarm monitoring rule to be updated, wherein the alarm monitoring rule comprises an alarm threshold value and alarm times; when at least one node in the corresponding nodes pulls the alarm monitoring rule to be updated, updating the number of nodes to be synchronized; when the number of the nodes to be synchronized is the same as that of the corresponding nodes, the alarm monitoring rule to be updated takes effect; and when the number of times that the acquired application performance data exceeds the alarm threshold value is more than or equal to the alarm number of times, sending alarm information.
According to the embodiment of the invention, the new alarm monitoring rule with three dimensions of application service, application example and application interface is obtained, and after all nodes corresponding to the new alarm monitoring rule obtain the rule, the new rule pulled by all nodes is enabled to take effect, so that multi-dimensional and fine-grained monitoring unified management can be realized. And moreover, when the collected application performance data triggers an alarm rule, alarm information is sent in time, so that automatic and near-real-time active early warning can be realized, and serious business influence is avoided.
In some embodiments of the present invention, the warning method further comprises: and determining the fault application service according to the alarm information, and reducing the priority of the fault application service, increasing a normal node corresponding to the fault application service or increasing the server resource amount of the fault application service.
According to the above embodiment of the present invention, the failure application service is quickly determined through the alarm information, and the range of the failure influence can be reduced to the lowest by reducing the priority of the failure application service, increasing the normal node corresponding to the failure application service, or increasing the server resource amount of the failure application service.
In some embodiments of the present invention, the warning method further comprises: deploying and loading probes for the application service based on a bytecode augmentation technology; collecting, by the probe, the application performance data for the application service.
According to the embodiment of the invention, the probe is loaded through the byte code enhancement technology, the probe deployment can be completed under the condition of no code intrusion, and the application performance data corresponding to numerous and complex application services in the system can be efficiently collected in real time through the loading probe.
In some embodiments of the present invention, the warning method further comprises: when the application service is used as an initial application service of a calling link, a probe of the initial application service generates a tracking parameter of the initial application service, and transmits the tracking parameter to each application service on the calling link according to the calling sequence of the application services in the calling link; and generating link tracking information based on the calling link according to the tracking parameters.
In some embodiments of the present invention, the warning method further comprises: and determining the dependency relationship and the fault influence range between the fault application services according to the link tracking information and the alarm information.
According to the above embodiment of the present invention, the probe of the initial application service of the calling link generates the tracking parameter and transmits the tracking parameter to each application service on the calling link to generate the link tracking information of the calling link, and the influence range of the fault and the dependency relationship between the fault applications can be quickly and simply determined based on the link tracking information, so as to quickly establish a fault recovery scheme and reduce the fault influence range.
According to a second aspect of the present invention, an alarm system based on application performance data is provided in an embodiment of the present invention, which includes: the rule updating module is used for acquiring an alarm monitoring rule to be updated aiming at an application service, an application example and an application interface and determining a node corresponding to the alarm monitoring rule to be updated, wherein the alarm monitoring rule comprises an alarm threshold value and alarm times; the rule synchronization module is used for updating the number of the nodes to be synchronized when at least one node in the corresponding nodes pulls the alarm monitoring rule to be updated; the rule validation module is used for validating the alarm monitoring rule to be updated when the number of the nodes to be synchronized is the same as that of the corresponding nodes; and the alarm module is used for sending alarm information when the number of times that the acquired application performance data exceeds the alarm threshold value is more than or equal to the alarm number of times.
According to the embodiment of the invention, the new alarm monitoring rule of the three dimensions of the application service, the application example and the application interface is obtained, and the new rule pulled by all the nodes takes effect after all the nodes corresponding to the new alarm monitoring rule obtain the rule, so that the multi-dimensional and fine-grained monitoring unified management can be realized. And moreover, when the collected application performance data triggers an alarm rule, alarm information is sent in time, so that automatic and near-real-time active early warning can be realized, and serious business influence is avoided.
In some embodiments of the invention, the alert system further comprises: and the fault processing module is used for determining a fault application service according to the alarm information, reducing the priority of the fault application service, and increasing a normal node corresponding to the fault application service or increasing the server resource amount of the fault application service.
According to the above embodiment of the present invention, the failure application service is quickly determined through the alarm information, and the range of the failure influence can be reduced to the lowest by reducing the priority of the failure application service, increasing the normal node corresponding to the failure application service, or increasing the server resource amount of the failure application service.
In some embodiments of the present invention, the alarm system further comprises a data acquisition module configured to perform the following operations: deploying and loading probes for the application service based on a bytecode augmentation technology; collecting, by the probe, the application performance data for the application service.
According to the embodiment of the invention, the probe is loaded through the byte code enhancement technology, the probe deployment can be completed under the condition of no code intrusion, and the application performance data corresponding to numerous and complex application services in the system can be efficiently collected in real time through the loading probe.
In some embodiments of the invention, the alert system further comprises an information generation module for performing the following operations: when the application service is used as an initial application service of a calling link, a probe of the initial application service generates a tracking parameter of the initial application service, and transmits the tracking parameter to each application service on the calling link according to the calling sequence of the application services in the calling link; and generating link tracking information based on the calling link according to the tracking parameters.
In some embodiments of the present invention, the alarm system further includes a fault tracing module, configured to determine a dependency relationship and a fault influence range between fault application services according to the link tracing information and the alarm information.
According to the above embodiment of the present invention, the probe of the initial application service of the call link generates the tracking parameter and transmits the tracking parameter to each application service on the call link to generate the link tracking information of the call link, and the influence range of the fault and the dependency relationship between the fault applications can be quickly and simply determined based on the link tracking information, so as to quickly establish a fault recovery scheme and reduce the fault influence range.
According to a third aspect of the present invention, the present invention provides a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor, cause a computer to perform the following operations: the operation includes the steps included in the warning method according to any one of the above embodiments.
According to a fourth aspect of the present invention, the present invention provides a computer device including a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, can implement the alarm method according to any one of the above embodiments.
Therefore, by implementing the alarm method, the system, the equipment and the storage medium based on the application performance data provided by the invention, the new alarm monitoring rules of multiple dimensions are obtained, and the new rules pulled by all the nodes are enabled to take effect after all the nodes corresponding to the new alarm monitoring rules obtain the rules, so that the multi-dimension and fine-grained monitoring unified management can be realized. And moreover, when the collected application performance data triggers an alarm rule, alarm information is sent in time, so that automatic and near-real-time active early warning can be realized, and serious business influence is avoided.
Drawings
FIG. 1 is a flow diagram of an alerting method based on application performance data according to one embodiment of the present invention;
FIG. 2 is a flow diagram illustrating an alerting method based on application performance data in accordance with a further embodiment of the present invention;
FIG. 3 is an architecture diagram of an alert system based on application performance data according to one embodiment of the present invention.
Detailed Description
Various aspects of the invention are described in detail below with reference to the figures and the detailed description. Well-known modules, units and their interconnections, links, communications or operations with each other are not shown or described in detail. Furthermore, the described features, architectures, or functions can be combined in any manner in one or more implementations. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. It will also be readily understood that the modules or units or processes of the embodiments described herein and illustrated in the figures can be combined and designed in a wide variety of different configurations.
The terms used herein are briefly described below.
Elastic search: a distributed, highly-extended, highly-real-time search and data analysis engine.
And (3) AOP: and programming facing to the tangent plane.
MPSC: multi producer single consumer, multi producer single consumer queue.
OOM: out Of Memory, memory overflow.
webhook: the user changes one behavior of the application by means of the custom callback.
UUID: universal Unique Identifier.
Fig. 1 is a flowchart illustrating an alarm method based on application performance data according to an embodiment of the present invention.
As shown in fig. 1, in an embodiment of the present invention, the warning method may include: step S11, step S12, step S13, and step S14, which will be described in detail below.
In step S11, an alarm monitoring rule to be updated for an application service, an application instance, and an application interface is obtained, and a node corresponding to the alarm monitoring rule to be updated is determined, where the alarm monitoring rule includes an alarm threshold and alarm times. In some embodiments, the alarm monitoring rules include, but are not limited to, one or more of the following: alarm index, alarm threshold, operational character, alarm period, alarm times, silent time, response time, average response time, interface request success rate, call times per minute, request success rate of application service instance.
In a further embodiment, the alarm monitoring rule is directed at three dimensions of an application service, an application instance and an application interface, and can make a single personalized rule for different objects with different dimensions, and in addition, different alarm rules can be made for different nodes. Therefore, different alarm monitoring rules can be set for each service, the types and the number of the alarm monitoring rules which can be set for each accessed service are unlimited, the services and the alarm monitoring rules are in a one-to-many relationship, and the alarm rules between the services are not influenced by each other. Meanwhile, differential alarm monitoring rule configuration is carried out on each interface, so that the requirements of different services on monitoring alarm can be met; the alarm monitoring rules of all nodes are formulated aiming at the current service multi-node deployment environment, and the method can adapt to the service quality difference brought by the resource difference of different clusters or different machine rooms.
In some embodiments, the alarm monitoring rule configuration platform adds new alarm monitoring rule configurations to be updated for the application services, application instances and application interfaces to be monitored. Specifically, after binding the application service ID, the application instance ID, the application interface ID and the alarm monitoring rule, storing the binding into an elastic search, and setting the synchronization state corresponding to the alarm monitoring rule to be updated as an update; and pulling the alarm monitoring rule configuration to be updated from the ElasticSearch to a memory by the observation platform deployed by the multiple nodes according to a preset period, for example, pulling the alarm monitoring rule configuration to be updated once every 5 minutes.
In step S12, when at least one node in the corresponding nodes pulls the alarm monitoring rule to be updated, the number of nodes to be synchronized is updated. In some embodiments, when one of the nodes corresponding to the alarm monitoring rule to be updated pulls the rule, the number of the nodes to be synchronized is +1.
In step S13, when the number of the nodes to be synchronized is the same as the number of the corresponding nodes, the alarm monitoring rule to be updated takes effect. After all the nodes pull the alarm monitoring rule to be updated to the memory, the alarm monitoring rule to be updated takes effect, so that the problem of inconsistent alarm rules in the multi-node memory can be avoided.
In step S14, when the number of times that the collected application performance data exceeds the alarm threshold is greater than or equal to the alarm number, alarm information is sent. Wherein the fully collected application performance data includes, but is not limited to, one or more of: the response time of the calling interface, the average response time of the calling interface in a certain period of time, the success state of the calling interface at this time, the success rate of the calling interface in a certain period of time, the calling times of the calling interface in a certain period of time and the like. In some embodiments, polling is performed every time a period passes, if the collected application performance data meets the standard of sending the alarm information, the webhook interface is called to send the alarm information in an asynchronous mode of a thread pool, and particularly, the webhook interface is defined by a user, so that various alarm modes such as a nail alarm, a mail alarm, a business WeChat alarm and the like can be written in the designated webhook interface by the user. In a further implementation mode, when the alarm information needs to be sent, whether the alarm information is in the silent period or not is regularly monitored, and the alarm information is sent only if the alarm information is not in the silent period.
In some embodiments, the alarm based on the alarm monitoring rule is monitored by a time window slider design method, for example, first, a time window (1 minute) is slid every minute, and a probe reports application performance data of a certain interface; and then initializing a time window of the alarm monitoring rule, recording the hit alarm rule when the application performance data in the time window exceeds the threshold value of the alarm monitoring rule, so that the times of the application performance data exceeding the alarm threshold value is +1, and sending alarm information when the acquired times of the application performance data exceeding the alarm threshold value is more than or equal to the alarm times and is not in a silent period.
By adopting the method for managing application configuration, the new alarm monitoring rule of the application service, the application instance and the application interface is obtained, and the new rule pulled by all the nodes is enabled to take effect after all the nodes corresponding to the new alarm monitoring rule obtain the rule, so that multi-dimensional and fine-grained monitoring unified management can be realized. And moreover, when the collected application performance data triggers an alarm rule, alarm information is sent in time, so that automatic and near-real-time active early warning can be realized, and serious business influence is avoided.
In a further embodiment, a fault application service is determined according to the alarm information, and the priority of the fault application service is reduced, and a normal node corresponding to the fault application service is increased or the server resource amount of the fault application service is increased. Thereby, the range of the influence of the failure can be minimized.
In some embodiments, the application performance data is collected by: deploying and loading a probe for the application service based on a byte code enhancement technology; collecting, by the probe, the application performance data for the application service. Specifically, based on the Java Agent probe technology, call interception and data collection are realized in a bytecode injection mode, so that real code non-invasion can be realized, the deployment of the probe can be completed only by adding an address of an alarm monitoring system (observable platform) and an application name accessed to the alarm monitoring system when a server is started to determine a receiving party of data collected by the probe and a data generating party of the data collected by the probe, and further, the collection, analysis and summarization of application performance data to be monitored are completed through the probe, and finally, summarized data are stored to a Web interface at regular time for a user to perform intuitive observation. In a further embodiment, application performance data is collected from the services and the cloud native infrastructure, and the collected application performance data is analyzed, aggregated and visually displayed, so that the topology between the services and the endpoints and various performance indexes of each application service, application instance and application interface can be seen.
By deploying and loading the probe in a non-invasive manner, when a monitored system breaks down, the process of acquiring application performance data by the probe cannot be affected, the performance loss of application operation monitored by the probe is very low and is lower than 5%, probe deployment can be completed under the condition of no code intrusion, and application performance data corresponding to numerous and complex application services in the system can be efficiently acquired in real time by the loading probe.
In some embodiments, the AOP mechanism used by the Java Agent plugin development is realized based on a template method mode, and the method can realize effective wind control effect, and even if the realization logic of the plugin is abnormal, the execution of the user logic of the access application is not influenced. In an optional implementation manner, decoupling is performed by applying a lightweight lock-free ring queue between a logic for acquiring application performance data by a plug-in and a logic for reporting the application performance data, so that an effect of not influencing an action of acquiring data when reporting the data is achieved, and protection of an application is achieved.
In some embodiments, the probe functionality may be enhanced by loading the management plug-in class through a custom class loader to avoid conflicts and contamination to the accessed system. For example, the accessed application uses a mysql database, the probe may select to load a plug-in of mysql, and the probe may further obtain the usage index of the mysql, such as the executed sql content, the execution time of the sql, and the like. Besides the plug-ins of the database class, the plug-ins also comprise plug-ins of middleware such as a web server, httpclient, MQ, redis and the like. The plug-in loaded by the probe and the middleware used by the application are not influenced mutually by the custom class loader, for example, for the plug-in of the mysql, the mysql database used by the application is not influenced by the mysql plug-in loaded by the probe.
In a further embodiment, when the application service is used as an initial application service of a call link, a probe of the initial application service generates a tracking parameter of the initial application service and transmits the tracking parameter to each application service on the call link according to the call sequence of the application services in the call link; and generating link tracking information based on the calling link according to the tracking parameters. Furthermore, the dependency relationship and the fault influence range between fault application services are determined according to the link tracking information and the alarm information. Wherein the tracking parameters include, but are not limited to: traceId (tracking ID), application service ID, and link sequence number.
The probe of the initial application service of the calling link generates the tracking parameter and transmits the tracking parameter to each application service on the calling link so as to generate the link tracking information of the calling link, and the influence range of the fault and the dependency relationship between the fault applications can be judged quickly and simply based on the link tracking information, so that the fault recovery scheme can be established quickly, and the fault influence range can be reduced.
The invention provides an alarm method based on application performance data, which is further implemented by adopting the method, and the method generates link tracking information based on a calling link through tracking parameters, so that the influence range of the fault and the dependency relationship between fault applications can be quickly and simply judged based on the link tracking information, a fault recovery scheme can be quickly formulated, and the fault influence range can be reduced. As shown in fig. 2, the alarm method based on application performance data of the further embodiment includes the following steps:
step 1, respectively deploying loading probes for accessed applications (services);
step 2a, an application A calls an interface of an application B through an HTTP request;
and 2B, generating a globally unique traceId, an idA and a link sequence number 01 of the application by the probe of the application A, and transmitting the generated link tracking parameters to the application B while calling the interface of the application B by the application A through the HTTP request. When an application initiates an HTTP request call, a probe generates a globally unique TraceId by adopting a mode similar to a snowflake algorithm (32-bit UUID + current thread ID + current timestamp + 4-bit random number) aiming at the call;
step 3a, the application B calls an interface of the application C through an HTTP request;
step 3B, the probe of the application B transmits the traceId, the idB and the link sequence number 02 of the application B through the probe of the application A, and transmits the generated link tracking parameters to the application C while the application B calls the interface of the application C through the HTTP request;
step 4a, the application C calls an interface of the application D through an HTTP request;
step 4B, the application C transmits the TraceId transmitted by the probe of the application B, the idC of the application C and the link sequence number 03 to the application D through the HTTP request, and simultaneously transmits the generated link tracking parameters to the application D;
step 5, the application D finishes the code logic and finally returns the data of the HTTP request;
step 6, using the probe of D to track the link of the call path (A → B → C → D) of the whole request: and the TraceId, the idA, the link serial number 01, the idB, the link serial number 02, the idC, the link serial number 03, the idD and the link serial number 04 are uploaded to a storage medium of the monitoring system for storage.
In a further embodiment, when a failure occurs, according to the TraceId stored in the storage medium, all applications and calling directions passed by the whole HTTP request can be quickly retrieved according to the TraceId, so as to determine the influence range of the failure. For example, for a determined slow request, the source of the slow request can be quickly found according to the link trace information, the performance problem of all services on a call link can be analyzed, and the influence range of the slow request can be determined
In some embodiments, the alarm method of an embodiment of the present invention further includes the following step of deleting the alarm monitoring rule: (1) Deleting the alarm monitoring rule on an alarm monitoring rule configuration platform according to the actual conditions of the service and the service, and setting the synchronous state corresponding to the alarm monitoring rule to be deleted as deletion; (2) When one of the nodes corresponding to the alarm monitoring rule to be deleted in the observable platform pulls the alarm monitoring rule in the deleted state, the number of the nodes to be synchronized is +1: (3) The number of the synchronous nodes to be checked of the platform can be observed, and when the number of the synchronous nodes to be checked is the same as the number of the nodes corresponding to the alarm monitoring rule to be deleted, the alarm monitoring rule is deleted.
FIG. 3 is an architectural diagram of an alerting system based on application performance data according to one embodiment of the present invention.
As shown in fig. 3, the alarm system includes:
the rule updating module 310 is configured to obtain an alarm monitoring rule to be updated for an application service, an application instance, and an application interface, and determine a node corresponding to the alarm monitoring rule to be updated, where the alarm monitoring rule includes an alarm threshold and alarm times. In some embodiments, the alarm monitoring rules include, but are not limited to, one or more of the following: alarm index, alarm threshold, operator, alarm period, alarm times, silence time, response time, average response time, interface request success rate, call times per minute, request success rate of application service instance.
In a further embodiment, the alarm monitoring rule is directed at three dimensions of application service, application instance and application interface, and can formulate a separate personalized rule for different objects with different dimensions, and in addition, can formulate different alarm rules for different nodes. Therefore, different alarm monitoring rules can be formulated for each service, the types and the number of the alarm monitoring rules which can be formulated for each accessed service are unlimited, the services and the alarm monitoring rules are in a one-to-many relationship, and the alarm rules between the services are not influenced by each other. Meanwhile, differential alarm monitoring rule configuration is carried out on each interface, so that the requirements of different services on monitoring alarm can be met; the alarm monitoring rules of all nodes are formulated aiming at the current service multi-node deployment environment, and the method can adapt to the service quality difference brought by the resource difference of different clusters or different machine rooms.
A rule synchronization module 320, configured to update the number of nodes to be synchronized when at least one node in the corresponding nodes pulls the alarm monitoring rule to be updated. In some embodiments, when one of the nodes corresponding to the alarm monitoring rule to be updated pulls the rule, the number of the nodes to be synchronized is +1.
A rule validation module 330, configured to validate the alarm monitoring rule to be updated when the number of the nodes to be synchronized is the same as the number of the corresponding nodes.
A data acquisition module 340, configured to deploy and load a probe for the application service based on a bytecode-enhanced technology; and acquiring application performance data of the application service through the probe. Specifically, based on a Java Agent probe technology, calling interception and data collection are achieved through a byte code injection mode, real code non-invasion can be achieved, deployment of a probe can be completed only by adding an address of an alarm monitoring system (an observable platform) and an application name accessed to the alarm monitoring system when a server is started, collection, analysis and summarization of application performance data to be monitored are completed through the probe, and finally summarized data are stored to a Web interface at regular time for a user to carry out visual observation. By deploying and loading the probe without intrusions, when a monitored system breaks down, the process of acquiring application performance data by the probe cannot be influenced, the performance loss of application operation is very low and is lower than 5% by monitoring the probe, probe deployment can be completed under the condition of no code intrusion, and application performance data corresponding to numerous and complex application services in the system can be efficiently acquired in real time by the loading probe.
And an alarm module 350, configured to send alarm information when the number of times that the acquired application performance data exceeds the alarm threshold is greater than or equal to the alarm number. Wherein the full-scale collected application performance data includes, but is not limited to, one or more of: the response time of the calling interface, the average response time of the calling interface in a certain period of time, the success state of the calling interface at this time, the success rate of the calling interface in a certain period of time, the calling times of the calling interface in a certain period of time and the like. In some embodiments, after each cycle of time polling, if the collected application performance data meets the standard of sending the alarm information, the webhook interface is called in a thread pool asynchronous mode to send the alarm information, and specifically, the webhook interface is defined by a user, so that various alarm modes such as a nail alarm, a mail alarm, an enterprise WeChat alarm and the like can be written in the designated webhook interface by the user. In a further implementation mode, when the alarm information needs to be sent, whether the alarm information is in the silent period or not is regularly monitored, and the alarm information is sent only if the alarm information is not in the silent period.
And the fault processing module 360 is configured to determine a fault application service according to the alarm information, reduce the priority of the fault application service, and increase a normal node corresponding to the fault application service or increase the amount of server resources of the fault application service. Thereby, the range of the influence of the failure can be minimized.
An information generating module 370, configured to, when the application service is an initial application service of a call link, generate a tracking parameter of the initial application service by a probe of the initial application service, and transfer the tracking parameter to each application service on the call link according to an application service call sequence in the call link; and generating link tracking information based on the calling link according to the tracking parameters. In some embodiments, the tracking parameters include, but are not limited to: traceId (tracking ID), application service ID, and link sequence number.
And the fault tracking module 380 is configured to determine a dependency relationship and a fault influence range between the fault application services according to the link tracking information and the alarm information. The probe of the initial application service of the calling link generates the tracking parameter and transmits the tracking parameter to each application service on the calling link so as to generate the link tracking information of the calling link, and the influence range of the fault and the dependency relationship between the fault applications can be judged quickly and simply based on the link tracking information, so that the fault recovery scheme can be established quickly, and the fault influence range can be reduced.
By adopting the alarm system of the embodiment of the invention, the new alarm monitoring rule of the three dimensions of the application service, the application example and the application interface is obtained, and the new rule pulled by all the nodes is enabled to take effect after all the nodes corresponding to the new alarm monitoring rule obtain the rule, so that the multi-dimensional and fine-grained monitoring unified management can be realized. And moreover, when the collected application performance data triggers an alarm rule, alarm information is sent in time, so that automatic and near-real-time active early warning can be realized, and serious business influence is avoided.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.
Correspondingly, the embodiment of the invention also provides a computer readable storage medium, on which computer readable instructions or a program are stored, and when the computer readable instructions or the program are executed by a processor, the computer is enabled to execute the following operations: the operation includes the steps included in the alarm method according to any of the above embodiments, and details are not repeated here. Wherein the storage medium may include: such as optical disks, hard disks, floppy disks, flash memory, magnetic tape, etc.
In addition, the present invention also provides a computer device including a memory and a processor, where the memory is used for storing one or more computer readable instructions or programs, and when the processor executes the one or more computer readable instructions or programs, the alarm method according to any one of the above embodiments can be implemented. The computer device may be, for example, a server, a desktop computer, a notebook computer, a tablet computer, or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims (12)

1. An alarm method based on application performance data, characterized in that the alarm method comprises:
acquiring an alarm monitoring rule to be updated aiming at an application service, an application example and an application interface and determining a node corresponding to the alarm monitoring rule to be updated, wherein the alarm monitoring rule comprises an alarm threshold value and alarm times;
when at least one node in the corresponding nodes pulls the alarm monitoring rule to be updated, updating the number of nodes to be synchronized;
when the number of the nodes to be synchronized is the same as that of the corresponding nodes, the alarm monitoring rule to be updated takes effect;
and when the number of times that the acquired application performance data exceeds the alarm threshold value is greater than or equal to the alarm number of times, sending alarm information.
2. The alerting method of claim 1 wherein the alerting method further comprises:
and determining the fault application service according to the alarm information, and reducing the priority of the fault application service, increasing a normal node corresponding to the fault application service or increasing the server resource amount of the fault application service.
3. The alerting method of claim 1 wherein the alerting method further comprises:
deploying and loading probes for the application service based on a bytecode augmentation technology;
collecting, by the probe, the application performance data for the application service.
4. The alerting method of claim 3 wherein the alerting method further comprises:
when the application service is used as an initial application service of a calling link, a probe of the initial application service generates a tracking parameter of the initial application service, and transmits the tracking parameter to each application service on the calling link according to the calling sequence of the application services in the calling link; and
and generating link tracking information based on the calling link according to the tracking parameters.
5. The alerting method of claim 4 wherein the alerting method further comprises:
and determining the dependency relationship and the fault influence range between the fault application services according to the link tracking information and the alarm information.
6. An alert system based on application performance data, the alert system comprising:
the rule updating module is used for acquiring an alarm monitoring rule to be updated aiming at an application service, an application example and an application interface and determining a node corresponding to the alarm monitoring rule to be updated, wherein the alarm monitoring rule comprises an alarm threshold value and alarm times;
the rule synchronization module is used for updating the number of the nodes to be synchronized when at least one node in the corresponding nodes pulls the alarm monitoring rule to be updated;
the rule validation module is used for validating the alarm monitoring rule to be updated when the number of the nodes to be synchronized is the same as that of the corresponding nodes;
and the alarm module is used for sending alarm information when the number of times that the acquired application performance data exceeds the alarm threshold value is greater than or equal to the alarm number of times.
7. The alert system according to claim 6, wherein the alert system further comprises:
and the fault processing module is used for determining a fault application service according to the alarm information, reducing the priority of the fault application service, and increasing a normal node corresponding to the fault application service or increasing the server resource amount of the fault application service.
8. The alert system of claim 6, wherein the alert system further comprises a data acquisition module to perform the following operations:
deploying and loading probes for the application service based on a byte code enhancement technology;
collecting, by the probe, the application performance data for the application service.
9. The alert system of claim 8, wherein the alert system further comprises an information generation module to perform the operations of:
when the application service is used as an initial application service of a calling link, a probe of the initial application service generates a tracking parameter of the initial application service, and transmits the tracking parameter to each application service on the calling link according to the calling sequence of the application services in the calling link; and
and generating link tracking information based on the calling link according to the tracking parameters.
10. The alert system according to claim 9, wherein the alert system further comprises:
and the fault tracking module is used for determining the dependency relationship and the fault influence range between the fault application services according to the link tracking information and the alarm information.
11. A computer readable storage medium storing computer readable instructions, wherein the computer readable instructions are executed by a processor to implement the alerting method of any one of claims 1-5.
12. A computer device comprising a memory and a processor, the memory having stored thereon computer readable instructions, wherein the processor executes the computer readable instructions to implement the alerting method of any one of claims 1-5.
CN202211593180.XA 2022-12-13 2022-12-13 Alarm method, system, equipment and storage medium based on application performance data Pending CN115809179A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211593180.XA CN115809179A (en) 2022-12-13 2022-12-13 Alarm method, system, equipment and storage medium based on application performance data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211593180.XA CN115809179A (en) 2022-12-13 2022-12-13 Alarm method, system, equipment and storage medium based on application performance data

Publications (1)

Publication Number Publication Date
CN115809179A true CN115809179A (en) 2023-03-17

Family

ID=85485644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211593180.XA Pending CN115809179A (en) 2022-12-13 2022-12-13 Alarm method, system, equipment and storage medium based on application performance data

Country Status (1)

Country Link
CN (1) CN115809179A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117395132A (en) * 2023-12-13 2024-01-12 江西云眼视界科技股份有限公司 Distributed alarm monitoring method, system, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117395132A (en) * 2023-12-13 2024-01-12 江西云眼视界科技股份有限公司 Distributed alarm monitoring method, system, storage medium and electronic equipment
CN117395132B (en) * 2023-12-13 2024-02-20 江西云眼视界科技股份有限公司 Distributed alarm monitoring method, system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US11314758B2 (en) Storing and querying metrics data using a metric-series index
US9979608B2 (en) Context graph generation
US11347577B1 (en) Monitoring features of components of a distributed computing system
US20230177062A1 (en) Generating files for visualizing query results
US20190354461A1 (en) Multi-dimensional selective tracing
US20090199047A1 (en) Executing software performance test jobs in a clustered system
US20090199160A1 (en) Centralized system for analyzing software performance metrics
Wu et al. Zeno: Diagnosing performance problems with temporal provenance
US10528456B2 (en) Determining idle testing periods
US20170279660A1 (en) Context graph augmentation
US20210029003A1 (en) Techniques for updating knowledge graphs for correlating service events in computer network diagnostics
US20190317834A1 (en) Using and Updating Topological Relationships Amongst a Set of Nodes in Event Clustering
US10346281B2 (en) Obtaining and analyzing a reduced metric data set
US11507672B1 (en) Runtime filtering of computer system vulnerabilities
CN116192621A (en) Method for tracking service call chain based on Opentracking link
CN115809179A (en) Alarm method, system, equipment and storage medium based on application performance data
Weng et al. Kmon: An in-kernel transparent monitoring system for microservice systems with ebpf
US20220004579A1 (en) Streaming method for the creation of multifaceted statistical distributions
CN103414717A (en) Simulation monitoring method and system in regard to C / S structure service system
Meng et al. Driftinsight: detecting anomalous behaviors in large-scale cloud platform
CN112422349B (en) Network management system, method, equipment and medium for NFV
CN114756301A (en) Log processing method, device and system
CN111338609A (en) Information acquisition method and device, storage medium and terminal
CN113127309B (en) Program monitoring method and device, electronic equipment and storage medium
US20230244561A1 (en) Localization of telemetry issues based on logical data flows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination