US20230259436A1 - Systems and methods for monitoring application health in a distributed architecture - Google Patents
Systems and methods for monitoring application health in a distributed architecture Download PDFInfo
- Publication number
- US20230259436A1 US20230259436A1 US18/139,101 US202318139101A US2023259436A1 US 20230259436 A1 US20230259436 A1 US 20230259436A1 US 202318139101 A US202318139101 A US 202318139101A US 2023259436 A1 US2023259436 A1 US 2023259436A1
- Authority
- US
- United States
- Prior art keywords
- error
- system components
- health
- computing device
- components
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000036541 health Effects 0.000 title claims abstract description 122
- 238000012544 monitoring process Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims description 31
- 238000004891 communication Methods 0.000 claims abstract description 71
- 230000001419 dependent effect Effects 0.000 claims abstract description 22
- 230000004044 response Effects 0.000 claims abstract description 7
- 230000015654 memory Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 238000010801 machine learning Methods 0.000 description 8
- 230000001052 transient effect Effects 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- SPBWHPXCWJLQRU-FITJORAGSA-N 4-amino-8-[(2r,3r,4s,5r)-3,4-dihydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-oxopyrido[2,3-d]pyrimidine-6-carboxamide Chemical compound C12=NC=NC(N)=C2C(=O)C(C(=O)N)=CN1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O SPBWHPXCWJLQRU-FITJORAGSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0772—Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3055—Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/323—Visualisation of programs or trace data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/324—Display of status information
- G06F11/327—Alarm or error message display
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/875—Monitoring of systems including the internet
Definitions
- the present disclosure generally relates to monitoring application health of interconnected application system components within a distributed architecture system. More particularly, the disclosure relates to a holistic system for automatically identifying a root source of one or more errors in the distributed architecture system for subsequent analysis.
- current error monitoring systems utilize monitoring rules to track only an individual component in the architecture (typically the output component interfacing with external components) and raise an alert based on the individual component being tracked indicating an error.
- the monitoring rules can trigger the alert based on the individual component's error log, but do not take into account the whole distributed architecture system and rather rely on developers to manually troubleshoot and determine where the error may have actually occurred in a fragmented and error prone manner. That is, current monitoring techniques involve error analysis which is performed haphazardly using trial and error as well as being heavily human centric. This provides an unpredictable and fragmented analysis while utilizing extensive manual time and cost to possibly determine a root cause which may not be accurate.
- the analysis requires the support team to manually determine whether the error may have originated in the component which alerted the error or elsewhere in the system which may lead to uncertainties and being unfeasible due to the complexities of the distributed architecture.
- a system component e.g. an API directly associated with the user interface reporting the error may first be investigated and then a manual and resource intensive approach is performed to examine each and every system component to determine where the error would have originated.
- system components e.g. API software components
- this includes proactively spotting error patterns in the distributed architecture and notifying parties.
- the proposed disclosure provides, in at least some aspects, a standardized mechanism of automatically determining one or more system components (e.g. an API) originating the error in the distributed architecture.
- a computing device for monitoring and analyzing health of a distributed computer system having a plurality of interconnected system components, the computing device having a processor coupled to a memory, the memory storing instructions which when executed by the processor configure the computing device to: track communication between the system components and monitor for an alert indicating an error in the communication in the distributed computer system, upon detecting the error: receive a health log from each of the system components together defining an aggregate health log, each health log being in a standardized format indicating messages communicated between the system components; receive, from a data store, network infrastructure information defining one or more relationships for connectivity and communication flow between the system components, the relationships characterizing dependency information between the system components; and, automatically determine, based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components from the system components affected.
- the standardized format may comprise a JSON format.
- Each health log may further comprise a common identifier for tracing a route of the messages communicated for a transaction having the error.
- the computing device may further be configured to obtain health monitoring rules comprising data integrity information for pre-defined communications between the system components from the data store, the health monitoring rules for verifying whether each of the health logs complies with the data integrity information.
- the health monitoring rules may further defined based on historical error patterns for the distributed computer system associating a set of traffic flows for the messages between the system components and potentially occurring in each of the health logs to a corresponding error type.
- the computing device may further be configured to: determine from the dependency information indicating which of the system components are dependent on one another for operations performed in the distributed computer system, an impact of the error originated by the particular component on the associated dependent components.
- the computing device may further be configured to, upon detecting the alert, for: displaying the alert on a user interface of a client application for the device, the alert based on the particular component originating the error determined from the aggregate health log.
- the computing device may further be configured for displaying on the user interface along with the alert, the associated dependent components to the particular component.
- the system components may be APIs (application programming interfaces) on one or more connected computing devices and the health log may be an API log for logging activity for the respective API in communication with other APIs and related to the error.
- APIs application programming interfaces
- the processor may further configure the computing device to automatically determine origination of the error by: comparing each of the health logs in the aggregate health log to the other health logs in response to the relationships in the network infrastructure information.
- a computer implemented method for monitoring and analyzing health of a distributed computer system having a plurality of interconnected system components comprising: tracking communication between the system components and monitor for an alert indicating an error in the communication in the distributed computer system, upon detecting the error: receiving a health log from each of the system components together defining an aggregate health log, each health log being in a standardized format indicating messages communicated between the system components; receiving, from a data store, network infrastructure information defining one or more relationships for connectivity and communication flow between the system components, the relationships characterizing dependency information between the system components; and, automatically determining, based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components from the system components affected.
- a computer readable medium comprising a non-transitory device storing instructions and data, which when executed by a processor of a computing device, the processor coupled to a memory, configure the computing device to: track communication between system components of a distributed computer system having a plurality of interconnected system components and monitor for an alert indicating an error in the communication in the distributed computer system, upon detecting the error: receive a health log from each of the system components together defining an aggregate health log, each health log being in a standardized format indicating messages communicated between the system components; receive, from a data store, network infrastructure information defining one or more relationships for connectivity and communication flow between the system components, the relationships characterizing dependency information between the system components; and, automatically determine, based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components from the system components affected.
- a computer program product comprising a non-transient storage device storing instructions that when executed by at least one processor of a computing device, configure the computing device to perform in accordance with the methods herein.
- FIG. 1 is a schematic block diagram of a computing system environment for providing automated application health monitoring and error origination analysis in accordance with one or more aspects of the present disclosure.
- FIG. 2 is a schematic block diagram illustrating example components of a diagnostics server in FIG. 1 , in accordance with one or more aspects of the present disclosure.
- FIG. 3 is a flowchart illustrating example operations of the diagnostics server of FIG. 1 , in accordance with one or more aspects of the present disclosure.
- FIG. 4 is a schematic block diagram showing an example communication between the computing device comprising a plurality of interconnected system components A, B, and C and the diagnostic server components comprising the automatic analyzer and the data store in FIG. 1 in accordance with one or more aspects of the present disclosure.
- FIG. 5 is a diagram illustrating example health logs received from different system components and actions taken at the automatic analyzer in accordance with one or more aspects of the present disclosure.
- FIG. 6 is a diagram illustrating an example diagnostic results alert in accordance with one or more aspects of the present disclosure.
- FIG. 7 is a diagram illustrating a typical flow of communication for the health monitoring performed by the automatic analyzer for providing deep diagnostic analytics in accordance with one or more aspects of the present disclosure.
- FIG. 1 is a diagram illustrating an example computer network 100 in which a diagnostics server 108 , is configured for providing unified deep diagnostics of distributed system components and particularly, error characterization analysis for the distributed components of one or more computing device(s) 102 communicating across a communication network 106 .
- the diagnostics server 108 is configured to receive an aggregate health log including communication health logs 107 (individually 107 A, 107 B . . . 107 N) from each of the system components, collectively shown as system components 104 (individually shown as system components 104 A- 104 N) such as API components in a standard format.
- the communication health logs 107 may be linked for example via a common key tracing identifier that may show that a particular transaction involved components A, B, and C and the types of events or messages communicated for the transaction, by way of example.
- the common identifier comprises key metadata that interconnects via an entity function role.
- the common tracing identifier may link parties affecting a particular financial transaction.
- the common tracing identifier (e.g. traceability ID 506 in FIG. 5 ) may further be modified each time it is processed or otherwise passes through one of the components 104 to also facilitate identifying a path taken by a message when communicated between the components 104 during performing a particular function (e.g. effecting a transaction).
- the computing device(s) 102 each comprise at least a processor 103 , a memory 105 (e.g. a storage device, etc.) and one or more distributed system components 104 .
- the memory 105 storing instructions which when executed by the computing device(s) 102 configure the computing device(s) 102 to perform operations described herein.
- the distributed system components 104 may be configured (e.g. via the instructions stored in the memory 105 ) to provide the distributed architecture system described herein for collaborating together to provide a common goal such as access to resources on the computing devices 102 ; or access to communication services provided by the computing device 102 ; or performing one or more tasks in a distributed manner such that the computing nodes work together to provide the desired task functionality.
- the distributed system components 104 may comprise distributed applications such as application programming interfaces (APIs), user interfaces, etc.
- APIs application programming interfaces
- such a distributed architecture system provided by the computing device(s) 102 includes the components 104 being provided on different platforms (e.g. correspondingly different machines such that there are at least two computing devices 102 each containing some of the components 104 ) so that a plurality of the components (e.g. 104 A . . . 104 N) can cooperate with one another over the communication network 106 in order to achieve a specific objective or goal (e.g. completing a transaction or performing a financial trade).
- the computing device(s) 102 may be one or more distributed servers for various functionalities such as provided in a trading platform.
- Another example of the distributed system provided by computing device(s) 102 may be a client/server model. In this aspect no single computer in the system carries the entire load on system resources but rather the collaborating computers (e.g. at least two computing devices 102 ) execute jobs in one or more remote locations.
- the distributed architecture system provided by the computing device(s) 102 may be more generally, a collection of autonomous computing elements (e.g. which may be either hardware devices and/or a software processes such as system components 104 ) that appear to users as a single coherent system.
- the computing elements e.g. either independent machines or independent software processes
- a common communication network e.g. network 106
- a single computing device 102 is shown in FIG. 1 with distributed computing elements provided by system components 104 A- 104 N which reside on the single computing device 102 ; alternatively, a plurality of computing devices 102 connected across the communication network 106 in the network 100 may be provided with the components 104 spread across the computing devices 102 to collaborate and perform the distributed functionality via multiple computing devices 102 .
- the communications network 106 is thus coupled for communication with a plurality of computing devices. It is understood that communication network 106 is simplified for illustrative purposes. Communication network 106 may comprise additional networks coupled to the WAN such as a wireless network and/or local area network (LAN) between the WAN and the computing devices 102 and/or diagnostics server 108 .
- WAN wide area network
- LAN local area network
- the diagnostics server 108 further retrieves network infrastructure information 111 for the system components 104 (e.g. may be stored on the data store 116 , or directly provided from the computing devices 102 hosting the system components 104 ).
- the network infrastructure information 111 may characterize various types of relationships between the system components and/or communication connectivity information for the system components. For example, this may include dependency relationships, such as operational dependencies or communication dependencies between the system components 104 A- 104 N for determining the health of the system and tracing an error in the system to its source.
- the operational dependencies may include for example, whether a system component 104 requires another component to call upon or otherwise involve in order to perform system functionalities (e.g. performing a financial transaction may require component A to call upon functionalities of components B and N).
- the communication dependencies may include information about which components 104 are able to communicate with and/or receive information from one another (e.g. have wired or wireless communication links connecting them).
- the diagnostic server 108 comprises an automatic analyzer module 214 communicating with the data store 116 as will be further described with respect to FIG. 2 .
- the automatic analyzer module 214 receives aggregate health logs 107 for each of the components 104 A . . . 104 N associated with a particular task or job (e.g. accessing a resource provided by components 104 ) as well as network infrastructure information 111 and is then configured to determine a root cause of the error characterizing a particular system component (e.g. 104 A) which originated an error in the system.
- the automatic analyzer module 214 may be triggered to detect the source of an error upon monitoring system behaviors and determining that an error has occurred in the network 100 .
- Such a determination may be made by applying a set of monitoring rules 109 via the automatic analyzer module 214 which are based on historical error patterns for the system components 104 and associated traffic patterns thereby allowing deeper understanding of the error (e.g. API connection error) and the expected operational resolution.
- the monitoring rules 109 may be used by the automatic analyzer module 214 to map a historical error pattern (e.g. communications between components 104 following a specific traffic pattern as may be predicted by a machine learning module in FIG. 2 ) to a specific error type.
- the health monitoring rules 109 may indicate data integrity metadata indicating a format and/or content of messages communicated between components 104 . In this way, when the messages differ from the data integrity metadata, then the automatic analyzer module 214 may indicate (e.g. via a display on the diagnostics server 108 or computing device 102 ) that the error relates to data integrity deviations.
- the automatic analyzer module 214 may use the network infrastructure information 111 and the monitoring rules 109 (mapping error patterns to additional metadata characterizing the error) to identify the error, its root cause (e.g. via the relationship information in the network infrastructure information 111 ) and the dependency impact including other system components 104 affected by the error and having a relationship to the error originating system component.
- the network 100 utilizes a holistic approach to health monitoring by providing an automatic analyzer 214 coupled to all of the system components 104 (e.g. APIs) via the network 106 for analyzing the health of the system components 104 as a whole and individually.
- an automatic analyzer 214 coupled to all of the system components 104 (e.g. APIs) via the network 106 for analyzing the health of the system components 104 as a whole and individually.
- an error e.g. an API fails to perform an expected function or timeout occurs
- the error may be tracked and its origin located.
- the health logs 107 are converted to and/or provided in a standardized format (e.g. JSON format) from each of the system components 104 .
- the standardized format may further include a smart log pattern which can reveal functional dependencies between the system components 104 , and key metadata which interconnects message for a particular task or job (e.g. customer identification).
- the diagnostics server 108 is thus configured to receive the health logs 107 in a standardized format as well as receiving information about the network infrastructure (e.g. relationships and dependencies between the system components) from a data store to determine whether a detected system error follows a specific system error pattern and therefore the dependency impact of the error on related system components 104 .
- FIG. 2 is a diagram illustrating in schematic form an example computing device (e.g. diagnostics server 108 of FIG. 1 ), in accordance with one or more aspects of the present disclosure.
- the diagnostics server 108 facilitates providing a system to perform health monitoring of distributed architecture components (e.g. APIs) as a whole using health logs (e.g. API logs) and network architecture information defining relationships for the distributed architecture components.
- the system may further capture key metadata (e.g. key identifiers such as digital identification number of a transaction across an institution among various distributed components) to track messages communicated between the components and facilitate determining the route taken by the message when an error was generated.
- key metadata e.g. key identifiers such as digital identification number of a transaction across an institution among various distributed components
- the diagnostics server 108 is configured to utilize at least the health logs and the network architecture information to determine a root cause of an error generated in the overall system.
- Diagnostics server 108 comprises one or more processors 202 , one or more input devices 204 , one or more communication units 206 and one or more output devices 208 . Diagnostics server 108 also includes one or more storage devices 210 storing one or more modules such as automatic analyzer module 214 ; data integrity validation module; infrastructure validation module 218 ; machine learning module 220 ; alert module 222 ; a data store 116 for storing data comprising health logs 107 ; monitoring rules 109 ; and network infrastructure information 111 .
- modules such as automatic analyzer module 214 ; data integrity validation module; infrastructure validation module 218 ; machine learning module 220 ; alert module 222 ; a data store 116 for storing data comprising health logs 107 ; monitoring rules 109 ; and network infrastructure information 111 .
- Communication channels 224 may couple each of the components 116 , 202 , 204 , 206 , 208 , 210 , 214 , 216 and 218 for inter-component communications, whether communicatively, physically and/or operatively.
- communication channels 224 may include a system bus, a network connection, an inter-process communication data structure, or any other method for communicating data.
- processors 202 may implement functionality and/or execute instructions within diagnostics server 108 .
- processors 202 may be configured to receive instructions and/or data from storage devices 210 to execute the functionality of the modules shown in FIG. 2 , among others (e.g. operating system, applications, etc.)
- Diagnostics server 108 may store data/information to storage devices 210 such as health logs 107 ; monitoring rules 109 and network infrastructure info 111 .
- One or more communication units 206 may communicate with external devices via one or more networks (e.g. communication network 106 ) by transmitting and/or receiving network signals on the one or more networks.
- the communication units may include various antennae and/or network interface cards, etc. for wireless and/or wired communications.
- Input and output devices may include any of one or more buttons, switches, pointing devices, cameras, a keyboard, a microphone, one or more sensors (e.g. biometric, etc.) a speaker, a bell, one or more lights, etc.
- One or more of same may be coupled via a universal serial bus (USB) or other communication channel (e.g. 224 ).
- USB universal serial bus
- the one or more storage devices 210 may store instructions and/or data for processing during operation of diagnostics server 108 .
- the one or more storage devices may take different forms and/or configurations, for example, as short-term memory or long-term memory.
- Storage devices 210 may be configured for short-term storage of information as volatile memory, which does not retain stored contents when power is removed.
- Volatile memory examples include random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), etc.
- Storage devices 210 in some examples, also include one or more computer-readable storage media, for example, to store larger amounts of information than volatile memory and/or to store such information for long term, retaining information when power is removed.
- Non-volatile memory examples include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memory (EPROM) or electrically erasable and programmable (EEPROM) memory.
- automatic analyzer module 214 may comprise an application which monitors communications between system components 104 and monitors for an alert indicating an error in the communications.
- the automatic analyzer module 214 receives an input indicating a health log (e.g. 107 A . . . 107 N) from each of the system components 104 together defining an aggregate health log 107 and the network infrastructure information 111 defining relationships including interdependencies for connectivity and/or operation and/or communication between the system components. Based on this, the automatic analyzer module 214 automatically determines a particular component of the system components originating the error and associated dependent components from the system components affected.
- a health log e.g. 107 A . . . 107 N
- this may include the automatic analyzer module 214 using the standardized format of messages in the health logs 107 to capture key identifiers (e.g. connection identifiers, message identifiers, etc.) linking a particular task to the messages and depicting a route travelled by the messages and applying the network infrastructure information 111 to the health logs to reveal a source of the error and the dependency impact.
- the automatic analyzer module 214 further accesses a set of monitoring rules 109 which may associate specific types of messages or traffic flows indicated in the health logs with specific system error patterns and typical dependency impacts (e.g. for a particular type of error X, system components A, B, and C would be affected).
- the machine learning module 220 may be configured to track communication flows between components 104 , usage/error patterns of the components 104 over a past time period to the current time period and help predict the presence of an error and its characteristics.
- the machine learning module 220 may generate a mapping table between specific error patterns in the messages communicated between the components 104 and corresponding information characterizing the error including error type, possible dependencies and expected operational resolution.
- the machine learning module 220 may utilize machine learning models such as regression techniques or convolutional neural networks, etc. to proactively predict additional error patterns and associated details based on historical usage data.
- the machine learning module 220 cooperates with the automatic analyzer module 214 for proactively determining that an error exists and characterizing the error.
- Data integrity validation module 216 may be configured to retrieve a set of predefined data integrity rules provided in the monitoring rules 109 to determine whether the data in the health logs 107 satisfies the data integrity rules (e.g. format and/or content of messages in the health logs 107 ).
- the data integrity rules e.g. format and/or content of messages in the health logs 107 .
- Infrastructure validation module 218 may be configured to retrieve a set of predefined network infrastructure rules (e.g. for a particular task) based on information determined from the health logs 107 and determine whether the data in the network infrastructure info 111 satisfies the predefined rules 109 .
- Alert module 222 may comprise a user interface either located on the server 108 or control of an external user interface (e.g. via the communication units 206 ) to display the error detected by the server 108 and characterizing information (e.g. the source of the error, dependency impacts, and possible operational solutions) to assist with the resolution of the error.
- Characterizing information e.g. the source of the error, dependency impacts, and possible operational solutions
- FIG. 6 An example of such an alert is shown in FIG. 6 .
- FIG. 3 is a flow chart of operations 300 which are performed by a computing device such diagnostics server 108 shown in FIGS. 1 and 2 .
- the computing device may comprise a processor and a communications unit configured to communicate with distributed system application components such as API components to monitor the application health of the system components and to determine the source of an error for subsequent resolution.
- the computing device e.g. the diagnostics server 108
- the computing device is configured to utilize instructions (stored in a non-transient storage device), which when executed by the processor configured the computing device to perform operations such as operations 300 .
- operations of the computing device track communication between the system components (e.g. components 104 ) in a distributed system and monitor for an alert indicating an error in the communication in the distributed computer system.
- monitoring for the alert includes applying monitoring rules to the communication to proactively detect errors in the distributed system by monitoring for the communication between the components matching a specific error pattern.
- the computing device may further be configured to obtain the monitoring rules which include data integrity information for each of the types of communications between the system components. The monitoring rules may be used to verify whether the health logs comply with the data integrity information (e.g. to determine whether the data being communicated or otherwise transacted is consistent and accurate over the lifecycle of a particular task).
- the health monitoring rules may further be defined based on historical error patterns for the communications in the distributed computer system. That is, a pattern set of pre-defined communication traffic flows for messages between the system components which may occur in each of the health logs may be mapped to particular error types. Thus, when a defined communication traffic flow is detected, it may be mapped to a particular error pattern thereby allowing further characterization of the error by error type including possible resolution.
- Operations 304 - 308 of the computing device are triggered in response to detecting the presence of the error.
- operations of the computing device trigger receiving a health log from each of the system components (e.g. 104 A- 104 N, collectively 104 ) together defining an aggregate health log.
- the health logs may be in a standardized format (e.g. JSON format) and utilize common key identifiers (e.g. connection identifier, digital identifier of a transaction, etc.). This allows consistency in the information communicated and tracking of the messages such that it can be used to determine a context of the messages and mapped to capture the key identifiers across the distributed components.
- the common key identifiers are used by the computing device for tracing a route of the messages communicated between the distributed system components and particularly, for a transaction having the error.
- the health logs may follow a particular log pattern with one or more metadata (e.g. customer identification number, traceability identification number, timestamp, event information, etc.) which allows tracking and identification of messages communicated with the distributed system components.
- metadata e.g. customer identification number, traceability identification number, timestamp, event information, etc.
- operations of the computing device configure receiving from a data store of the computing device, network infrastructure information defining one or more relationship for connectivity and communication flow between the system components.
- the relationships characterize dependency information between the system components.
- the network infrastructure information may indicate for example, how the components are connected to one another and for a set of defined operations, how they are dependent upon and utilize resources of another component in order to perform the defined operation.
- operations of the computing device automatically determine at least based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components affected from the system components.
- automatically determining the origination of the error in a distributed component system includes comparing each of the health logs to the other health logs in response to the relationships in the network infrastructure information and may include mapping the information to predefined patterns for the logs to determine where the deviations from the expected communications may have occurred.
- FIG. 4 shown is an example scenario for flow of messages between distributed system components located both internal to an organization (e.g. on a private network) and remote to the organization (e.g. outside the private network).
- FIG. 4 further illustrates monitoring of health of the distributed components including error source origination detection for an error occurring in the message.
- flow of messages may occur between internal system components 104 A- 104 C located on a first computing device (e.g. computing device 102 of FIG. 1 ) and component 104 D of an external computing device (e.g. a second computing device 102 ′) located outside the institution provided by systems A-C.
- a first computing device e.g. computing device 102 of FIG. 1
- component 104 D of an external computing device e.g. a second computing device 102 ′
- Other variations of distributions of the system components on computing devices may be envisaged.
- each system component 104 A- 104 D may reside on distinct computing devices altogether.
- the path of a message is shown as travelling across link 401 A to 401 B to 401 C.
- the automatic analyzer module 214 initially receives a set of API logs (e.g. aggregate health logs 107 A- 107 C characterizing message activity for system components 104 A- 104 D, events and/or errors communicated across links 401 A- 401 C) in a standardized format.
- the standardized format may be JSON and one or more key identifiers that link together the API logs as being related to a task or operation.
- FIG. 5 illustrates example API logs 501 - 503 (a type of health logs 107 A- 107 C) which may be communicated between system components 104 such as system components 104 A- 104 D of FIG. 4 .
- each API log from an API system component 104 would include API event information such as interactions with the API including calls or requests and its content.
- the API logs further include a timestamp 504 indicating a time of message and a traceability ID 506 which allows tracking a message path from one API to another (e.g. as shown in API logs 501 - 503 ).
- a message sent from a first API to a second API would have the same traceability ID (or at least a common portion in the traceability ID 506 ) with different timestamps 504 .
- the API logs 501 - 503 for all of the system components are reviewed at the automatic analyzer module 214 .
- the automatic analyzer module 214 receives network infrastructure info 111 metadata which defines relationships between the various API components 104 in the system including which component systems are dependent on others for each pre-defined type of action (e.g. message communication, performing a particular task, accessing a resource, etc.).
- the automatic analyzer module 214 may retrieve from a data store 116 , a set of health monitoring rules 109 which can define historical error patterns (e.g. an error of type X typically follows a path from API 1 to API 2 ) to recognize and diagnose errors.
- a set of health monitoring rules 109 may map a traffic pattern between the API logs (e.g. API logs 501 - 503 ) to a particular type of error.
- the automatic analyzer module 214 utilizes the aggregate API logs 107 A- 107 C (e.g. received from each of the system components having the same traceability ID), the network infrastructure information 111 and the monitoring rules 109 to determine which of the system components originated the error, characterizations of the error (e.g. based on historical error patterns) and associated dependent components directly affected by the error 507 .
- the disclosed method and system allows diagnosis of health of application data communicated between APIs and locating the errors for subsequent analysis, in one or more aspects.
- the system may provide the diagnostic results as an alert to a user interface.
- the user interface may be associated with the automatic analyzer module 214 so that a user (e.g. system support) can see which API(s) are having issues and determine corrective measures.
- the user interface may display the results either on the diagnostics server 108 or any of the computing devices 102 for further action.
- the alert may be an email, a text message, a video message or any other types of visual displays as envisaged by a person skilled in the art.
- the alert may be displayed on a particular device based on the particular component originating the error as determined from the received health logs.
- the alert is displayed on the user interface along with metadata characterizing the error including associated dependent components to the particular component originating the error.
- an example of the automatic analyzer module 214 generating and sending such an alert 600 to a computing device (e.g. 102 , 102 ′ or 106 , etc.) responsible for error resolution in the system component 104 which generated the error is shown in FIG. 6 .
- the automatic analyzer module 214 is configured to generate an email to the operations or support team (e.g. provided via a unified messaging platform and accessible via the computing devices in FIG. 1 ) detailing the error and reasoning for the error for subsequent resolution thereof.
- FIG. 7 shown is an example flow of messages 700 , provided in at least one aspect, shown as Message(1)-Message(3) communicated in the network 100 of FIG. 1 between distributed system components 104 A- 104 C (e.g. web tier(s) and API components) associated with distinct computing devices 102 A, 102 B, and 102 C, collectively referred to as 102 .
- the health of the distributed applications is monitored via health logs 107 A- 107 C (e.g. asyncMessage(1)-asyncMessage(3)) and subsequently analyzed by the diagnostics server 108 via the automatic analyzer module 214 (e.g. also referred to as UDD—unified deep diagnostic analytics).
- UDD automatic analyzer module
- the health logs 107 may utilize a standardized JSON format defining a unified smart log pattern (USLP).
- USLP unified smart log pattern
- the unified smart log pattern of the health logs 107 may enable a better understanding of the flow of messages; provide an indication of functional dependencies between the system components; and utilize a linking key metadata that connects messages via a common identifier (e.g. customer ID).
- the automatic analyzer module 214 monitors the health logs and may apply a set of monitoring rules (e.g. monitoring rules 109 in FIG. 1 ) to detect errors including the origination source via pre-defined error patterns shown at step 702 and the expected operational resolution.
- the monitoring rules 109 applied by the analytics analyzer module 214 may include a decision tree or other machine learning trained model which utilizes prior error patterns to predict the error pattern in the current flow of messages 700 .
- the results of the error analysis may be provided to a user interface at step 704 , e.g. via another computing device 706 for further resolution.
- An example of the notification provided at step 704 to the other computer 706 responsible for providing system support and error resolution for the system component which originated the error is shown in FIG. 6 .
- the notification provided at step 704 may be provided via e-mail, short message service (SMS), a graphical user interface (GUI), a dashboard (e.g. a type of GUI providing high level view of performance indicators), etc.
- SMS short message service
- the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit.
- Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
- computer-readable media generally may correspond to (1) tangible computer-readable storage media, which is non-transitory or (2) a communication medium such as a signal or carrier wave.
- Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
- a computer program product may include a computer-readable medium.
- such computer-readable storage media can comprise RAM, ROM, EEPROM, optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer.
- any connection is properly termed a computer-readable medium.
- computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but are instead directed to non-transient, tangible storage media.
- processors such as one or more general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), digital signal processors (DSPs), or other similar integrated or discrete logic circuitry.
- ASICs application specific integrated circuits
- FPGAs field programmable logic arrays
- DSPs digital signal processors
- processors may refer to any of the foregoing examples or any other suitable structure to implement the described techniques.
- the functionality described may be provided within dedicated software modules and/or hardware.
- the techniques could be fully implemented in one or more circuits or logic elements.
- the techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, an integrated circuit (IC) or a set of ICs (e.g., a chip set).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computer Hardware Design (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 16/925,862, filed Jul. 10, 2020, and entitled “SYSTEMS AND METHODS FOR MONITORING APPLICATION HEALTH IN A DISTRIBUTED ARCHITECTURE”, the contents of which are incorporated herein by reference.
- The present disclosure generally relates to monitoring application health of interconnected application system components within a distributed architecture system. More particularly, the disclosure relates to a holistic system for automatically identifying a root source of one or more errors in the distributed architecture system for subsequent analysis.
- In a highly distributed architecture, current error monitoring systems utilize monitoring rules to track only an individual component in the architecture (typically the output component interfacing with external components) and raise an alert based on the individual component being tracked indicating an error. Thus, the monitoring rules can trigger the alert based on the individual component's error log, but do not take into account the whole distributed architecture system and rather rely on developers to manually troubleshoot and determine where the error may have actually occurred in a fragmented and error prone manner. That is, current monitoring techniques involve error analysis which is performed haphazardly using trial and error as well as being heavily human centric. This provides an unpredictable and fragmented analysis while utilizing extensive manual time and cost to possibly determine a root cause which may not be accurate.
- Thus, when there is an error at one of the system components, the analysis requires the support team to manually determine whether the error may have originated in the component which alerted the error or elsewhere in the system which may lead to uncertainties and being unfeasible due to the complexities of the distributed architecture.
- In prior monitoring systems of distributed architectures, when an error occurs within the system, a system component (e.g. an API) directly associated with the user interface reporting the error may first be investigated and then a manual and resource intensive approach is performed to examine each and every system component to determine where the error would have originated.
- Accordingly, there is a need to provide a method and system to facilitate automated and dynamic application health monitoring in distributed architecture systems with a view to the entire system, such as to obviate or mitigate at least some or all of the above presented disadvantages.
- It is an object of the disclosure to provide a computing device for improved holistic health monitoring of system components (e.g. API software components) in a multi-component system of a distributed architecture to determine a root cause of errors (e.g. operational issues or software defects). In some aspects, this includes proactively spotting error patterns in the distributed architecture and notifying parties. The proposed disclosure provides, in at least some aspects, a standardized mechanism of automatically determining one or more system components (e.g. an API) originating the error in the distributed architecture.
- There is provided a computing device for monitoring and analyzing health of a distributed computer system having a plurality of interconnected system components, the computing device having a processor coupled to a memory, the memory storing instructions which when executed by the processor configure the computing device to: track communication between the system components and monitor for an alert indicating an error in the communication in the distributed computer system, upon detecting the error: receive a health log from each of the system components together defining an aggregate health log, each health log being in a standardized format indicating messages communicated between the system components; receive, from a data store, network infrastructure information defining one or more relationships for connectivity and communication flow between the system components, the relationships characterizing dependency information between the system components; and, automatically determine, based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components from the system components affected. The standardized format may comprise a JSON format.
- Each health log may further comprise a common identifier for tracing a route of the messages communicated for a transaction having the error.
- The computing device may further be configured to obtain health monitoring rules comprising data integrity information for pre-defined communications between the system components from the data store, the health monitoring rules for verifying whether each of the health logs complies with the data integrity information.
- The health monitoring rules may further defined based on historical error patterns for the distributed computer system associating a set of traffic flows for the messages between the system components and potentially occurring in each of the health logs to a corresponding error type.
- The computing device may further be configured to: determine from the dependency information indicating which of the system components are dependent on one another for operations performed in the distributed computer system, an impact of the error originated by the particular component on the associated dependent components.
- The computing device may further be configured to, upon detecting the alert, for: displaying the alert on a user interface of a client application for the device, the alert based on the particular component originating the error determined from the aggregate health log.
- The computing device may further be configured for displaying on the user interface along with the alert, the associated dependent components to the particular component.
- The system components may be APIs (application programming interfaces) on one or more connected computing devices and the health log may be an API log for logging activity for the respective API in communication with other APIs and related to the error.
- The processor may further configure the computing device to automatically determine origination of the error by: comparing each of the health logs in the aggregate health log to the other health logs in response to the relationships in the network infrastructure information.
- There is provided a computer implemented method for monitoring and analyzing health of a distributed computer system having a plurality of interconnected system components, the method comprising: tracking communication between the system components and monitor for an alert indicating an error in the communication in the distributed computer system, upon detecting the error: receiving a health log from each of the system components together defining an aggregate health log, each health log being in a standardized format indicating messages communicated between the system components; receiving, from a data store, network infrastructure information defining one or more relationships for connectivity and communication flow between the system components, the relationships characterizing dependency information between the system components; and, automatically determining, based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components from the system components affected.
- There is provided a computer readable medium comprising a non-transitory device storing instructions and data, which when executed by a processor of a computing device, the processor coupled to a memory, configure the computing device to: track communication between system components of a distributed computer system having a plurality of interconnected system components and monitor for an alert indicating an error in the communication in the distributed computer system, upon detecting the error: receive a health log from each of the system components together defining an aggregate health log, each health log being in a standardized format indicating messages communicated between the system components; receive, from a data store, network infrastructure information defining one or more relationships for connectivity and communication flow between the system components, the relationships characterizing dependency information between the system components; and, automatically determine, based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components from the system components affected.
- There is provided a computer program product comprising a non-transient storage device storing instructions that when executed by at least one processor of a computing device, configure the computing device to perform in accordance with the methods herein.
- These and other features of the disclosure will become more apparent from the following description in which reference is made to the appended drawings wherein:
-
FIG. 1 is a schematic block diagram of a computing system environment for providing automated application health monitoring and error origination analysis in accordance with one or more aspects of the present disclosure. -
FIG. 2 is a schematic block diagram illustrating example components of a diagnostics server inFIG. 1 , in accordance with one or more aspects of the present disclosure. -
FIG. 3 is a flowchart illustrating example operations of the diagnostics server ofFIG. 1 , in accordance with one or more aspects of the present disclosure. -
FIG. 4 is a schematic block diagram showing an example communication between the computing device comprising a plurality of interconnected system components A, B, and C and the diagnostic server components comprising the automatic analyzer and the data store inFIG. 1 in accordance with one or more aspects of the present disclosure. -
FIG. 5 is a diagram illustrating example health logs received from different system components and actions taken at the automatic analyzer in accordance with one or more aspects of the present disclosure. -
FIG. 6 is a diagram illustrating an example diagnostic results alert in accordance with one or more aspects of the present disclosure. -
FIG. 7 is a diagram illustrating a typical flow of communication for the health monitoring performed by the automatic analyzer for providing deep diagnostic analytics in accordance with one or more aspects of the present disclosure. -
FIG. 1 is a diagram illustrating anexample computer network 100 in which adiagnostics server 108, is configured for providing unified deep diagnostics of distributed system components and particularly, error characterization analysis for the distributed components of one or more computing device(s) 102 communicating across acommunication network 106. Thediagnostics server 108 is configured to receive an aggregate health log including communication health logs 107 (individually 107A, 107B . . . 107N) from each of the system components, collectively shown as system components 104 (individually shown assystem components 104A-104N) such as API components in a standard format. Thecommunication health logs 107 may be linked for example via a common key tracing identifier that may show that a particular transaction involved components A, B, and C and the types of events or messages communicated for the transaction, by way of example. In one example, the common identifier comprises key metadata that interconnects via an entity function role. In one case, if the messages communicated betweencomponents 104A-104N are financial transactions then the common tracing identifier may link parties affecting a particular financial transaction. The common tracing identifier (e.g. traceability ID 506 inFIG. 5 ) may further be modified each time it is processed or otherwise passes through one of thecomponents 104 to also facilitate identifying a path taken by a message when communicated between thecomponents 104 during performing a particular function (e.g. effecting a transaction). - The computing device(s) 102 each comprise at least a
processor 103, a memory 105 (e.g. a storage device, etc.) and one or moredistributed system components 104. Thememory 105 storing instructions which when executed by the computing device(s) 102 configure the computing device(s) 102 to perform operations described herein. Thedistributed system components 104 may be configured (e.g. via the instructions stored in the memory 105) to provide the distributed architecture system described herein for collaborating together to provide a common goal such as access to resources on thecomputing devices 102; or access to communication services provided by thecomputing device 102; or performing one or more tasks in a distributed manner such that the computing nodes work together to provide the desired task functionality. Thedistributed system components 104 may comprise distributed applications such as application programming interfaces (APIs), user interfaces, etc. - In some aspects, such a distributed architecture system provided by the computing device(s) 102 includes the
components 104 being provided on different platforms (e.g. correspondingly different machines such that there are at least twocomputing devices 102 each containing some of the components 104) so that a plurality of the components (e.g. 104A . . . 104N) can cooperate with one another over thecommunication network 106 in order to achieve a specific objective or goal (e.g. completing a transaction or performing a financial trade). For example the computing device(s) 102 may be one or more distributed servers for various functionalities such as provided in a trading platform. Another example of the distributed system provided by computing device(s) 102 may be a client/server model. In this aspect no single computer in the system carries the entire load on system resources but rather the collaborating computers (e.g. at least two computing devices 102) execute jobs in one or more remote locations. - In yet another aspect, the distributed architecture system provided by the computing device(s) 102 may be more generally, a collection of autonomous computing elements (e.g. which may be either hardware devices and/or a software processes such as system components 104) that appear to users as a single coherent system. Typically, the computing elements (e.g. either independent machines or independent software processes) collaborate together in such a way via a common communication network (e.g. network 106) to perform related tasks. Thus, the existence of multiple computing elements is transparent to the user in a distributed system.
- Furthermore, as described herein, although a
single computing device 102 is shown inFIG. 1 with distributed computing elements provided bysystem components 104A-104N which reside on thesingle computing device 102; alternatively, a plurality ofcomputing devices 102 connected across thecommunication network 106 in thenetwork 100 may be provided with thecomponents 104 spread across thecomputing devices 102 to collaborate and perform the distributed functionality viamultiple computing devices 102. - The
communications network 106 is thus coupled for communication with a plurality of computing devices. It is understood thatcommunication network 106 is simplified for illustrative purposes.Communication network 106 may comprise additional networks coupled to the WAN such as a wireless network and/or local area network (LAN) between the WAN and thecomputing devices 102 and/ordiagnostics server 108. - The diagnostics server 108 further retrieves
network infrastructure information 111 for the system components 104 (e.g. may be stored on thedata store 116, or directly provided from thecomputing devices 102 hosting the system components 104). Thenetwork infrastructure information 111 may characterize various types of relationships between the system components and/or communication connectivity information for the system components. For example, this may include dependency relationships, such as operational dependencies or communication dependencies between thesystem components 104A-104N for determining the health of the system and tracing an error in the system to its source. - The operational dependencies may include for example, whether a
system component 104 requires another component to call upon or otherwise involve in order to perform system functionalities (e.g. performing a financial transaction may require component A to call upon functionalities of components B and N). The communication dependencies may include information about whichcomponents 104 are able to communicate with and/or receive information from one another (e.g. have wired or wireless communication links connecting them). - Additionally, the
diagnostic server 108 comprises anautomatic analyzer module 214 communicating with thedata store 116 as will be further described with respect toFIG. 2 . Theautomatic analyzer module 214 receivesaggregate health logs 107 for each of thecomponents 104A . . . 104N associated with a particular task or job (e.g. accessing a resource provided by components 104) as well asnetwork infrastructure information 111 and is then configured to determine a root cause of the error characterizing a particular system component (e.g. 104A) which originated an error in the system. Theautomatic analyzer module 214 may be triggered to detect the source of an error upon monitoring system behaviors and determining that an error has occurred in thenetwork 100. Such a determination may be made by applying a set ofmonitoring rules 109 via theautomatic analyzer module 214 which are based on historical error patterns for thesystem components 104 and associated traffic patterns thereby allowing deeper understanding of the error (e.g. API connection error) and the expected operational resolution. In one aspect, the monitoring rules 109 may be used by theautomatic analyzer module 214 to map a historical error pattern (e.g. communications betweencomponents 104 following a specific traffic pattern as may be predicted by a machine learning module inFIG. 2 ) to a specific error type. Additionally, in at least one aspect, thehealth monitoring rules 109 may indicate data integrity metadata indicating a format and/or content of messages communicated betweencomponents 104. In this way, when the messages differ from the data integrity metadata, then theautomatic analyzer module 214 may indicate (e.g. via a display on thediagnostics server 108 or computing device 102) that the error relates to data integrity deviations. - Additionally, in at least one aspect, the
automatic analyzer module 214 may use thenetwork infrastructure information 111 and the monitoring rules 109 (mapping error patterns to additional metadata characterizing the error) to identify the error, its root cause (e.g. via the relationship information in the network infrastructure information 111) and the dependency impact includingother system components 104 affected by the error and having a relationship to the error originating system component. - Thus, in one or more aspects, the
network 100 utilizes a holistic approach to health monitoring by providing anautomatic analyzer 214 coupled to all of the system components 104 (e.g. APIs) via thenetwork 106 for analyzing the health of thesystem components 104 as a whole and individually. Notably, when an error occurs in the system (e.g. an API fails to perform an expected function or timeout occurs), the error may be tracked and its origin located. - In at least one aspect, the
health logs 107 are converted to and/or provided in a standardized format (e.g. JSON format) from each of thesystem components 104. The standardized format may further include a smart log pattern which can reveal functional dependencies between thesystem components 104, and key metadata which interconnects message for a particular task or job (e.g. customer identification). Thediagnostics server 108 is thus configured to receive thehealth logs 107 in a standardized format as well as receiving information about the network infrastructure (e.g. relationships and dependencies between the system components) from a data store to determine whether a detected system error follows a specific system error pattern and therefore the dependency impact of the error onrelated system components 104. -
FIG. 2 is a diagram illustrating in schematic form an example computing device (e.g. diagnostics server 108 ofFIG. 1 ), in accordance with one or more aspects of the present disclosure. Thediagnostics server 108 facilitates providing a system to perform health monitoring of distributed architecture components (e.g. APIs) as a whole using health logs (e.g. API logs) and network architecture information defining relationships for the distributed architecture components. The system may further capture key metadata (e.g. key identifiers such as digital identification number of a transaction across an institution among various distributed components) to track messages communicated between the components and facilitate determining the route taken by the message when an error was generated. Preferably, as described herein, thediagnostics server 108 is configured to utilize at least the health logs and the network architecture information to determine a root cause of an error generated in the overall system. -
Diagnostics server 108 comprises one ormore processors 202, one ormore input devices 204, one ormore communication units 206 and one ormore output devices 208.Diagnostics server 108 also includes one ormore storage devices 210 storing one or more modules such asautomatic analyzer module 214; data integrity validation module;infrastructure validation module 218;machine learning module 220;alert module 222; adata store 116 for storing data comprisinghealth logs 107; monitoringrules 109; andnetwork infrastructure information 111. -
Communication channels 224 may couple each of thecomponents communication channels 224 may include a system bus, a network connection, an inter-process communication data structure, or any other method for communicating data. - One or
more processors 202 may implement functionality and/or execute instructions withindiagnostics server 108. For example,processors 202 may be configured to receive instructions and/or data fromstorage devices 210 to execute the functionality of the modules shown inFIG. 2 , among others (e.g. operating system, applications, etc.)Diagnostics server 108 may store data/information tostorage devices 210 such ashealth logs 107; monitoringrules 109 andnetwork infrastructure info 111. Some of the functionality is described further below. - One or
more communication units 206 may communicate with external devices via one or more networks (e.g. communication network 106) by transmitting and/or receiving network signals on the one or more networks. The communication units may include various antennae and/or network interface cards, etc. for wireless and/or wired communications. - Input and output devices may include any of one or more buttons, switches, pointing devices, cameras, a keyboard, a microphone, one or more sensors (e.g. biometric, etc.) a speaker, a bell, one or more lights, etc. One or more of same may be coupled via a universal serial bus (USB) or other communication channel (e.g. 224).
- The one or
more storage devices 210 may store instructions and/or data for processing during operation ofdiagnostics server 108. The one or more storage devices may take different forms and/or configurations, for example, as short-term memory or long-term memory.Storage devices 210 may be configured for short-term storage of information as volatile memory, which does not retain stored contents when power is removed. Volatile memory examples include random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), etc.Storage devices 210, in some examples, also include one or more computer-readable storage media, for example, to store larger amounts of information than volatile memory and/or to store such information for long term, retaining information when power is removed. Non-volatile memory examples include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memory (EPROM) or electrically erasable and programmable (EEPROM) memory. - Referring to
FIGS. 1 and 2 ,automatic analyzer module 214 may comprise an application which monitors communications betweensystem components 104 and monitors for an alert indicating an error in the communications. Upon indication of an alert, theautomatic analyzer module 214 receives an input indicating a health log (e.g. 107A . . . 107N) from each of thesystem components 104 together defining anaggregate health log 107 and thenetwork infrastructure information 111 defining relationships including interdependencies for connectivity and/or operation and/or communication between the system components. Based on this, theautomatic analyzer module 214 automatically determines a particular component of the system components originating the error and associated dependent components from the system components affected. In one example, this may include theautomatic analyzer module 214 using the standardized format of messages in thehealth logs 107 to capture key identifiers (e.g. connection identifiers, message identifiers, etc.) linking a particular task to the messages and depicting a route travelled by the messages and applying thenetwork infrastructure information 111 to the health logs to reveal a source of the error and the dependency impact. In some aspects, theautomatic analyzer module 214 further accesses a set ofmonitoring rules 109 which may associate specific types of messages or traffic flows indicated in the health logs with specific system error patterns and typical dependency impacts (e.g. for a particular type of error X, system components A, B, and C would be affected). - The
machine learning module 220 may be configured to track communication flows betweencomponents 104, usage/error patterns of thecomponents 104 over a past time period to the current time period and help predict the presence of an error and its characteristics. Themachine learning module 220 may generate a mapping table between specific error patterns in the messages communicated between thecomponents 104 and corresponding information characterizing the error including error type, possible dependencies and expected operational resolution. In this way, themachine learning module 220 may utilize machine learning models such as regression techniques or convolutional neural networks, etc. to proactively predict additional error patterns and associated details based on historical usage data. In at least some aspects, themachine learning module 220 cooperates with theautomatic analyzer module 214 for proactively determining that an error exists and characterizing the error. - Data
integrity validation module 216 may be configured to retrieve a set of predefined data integrity rules provided in the monitoring rules 109 to determine whether the data in the health logs 107 satisfies the data integrity rules (e.g. format and/or content of messages in the health logs 107). -
Infrastructure validation module 218 may be configured to retrieve a set of predefined network infrastructure rules (e.g. for a particular task) based on information determined from thehealth logs 107 and determine whether the data in thenetwork infrastructure info 111 satisfies thepredefined rules 109. -
Alert module 222 may comprise a user interface either located on theserver 108 or control of an external user interface (e.g. via the communication units 206) to display the error detected by theserver 108 and characterizing information (e.g. the source of the error, dependency impacts, and possible operational solutions) to assist with the resolution of the error. An example of such an alert is shown inFIG. 6 . - Referring again to
FIG. 2 , it is understood that operations may not fall exactly within themodules 214; 216; 218; 220; and 222 such that one module may assist with the functionality of another. -
FIG. 3 is a flow chart ofoperations 300 which are performed by a computing devicesuch diagnostics server 108 shown inFIGS. 1 and 2 . The computing device may comprise a processor and a communications unit configured to communicate with distributed system application components such as API components to monitor the application health of the system components and to determine the source of an error for subsequent resolution. The computing device (e.g. the diagnostics server 108) is configured to utilize instructions (stored in a non-transient storage device), which when executed by the processor configured the computing device to perform operations such asoperations 300. - At 302, operations of the computing device (e.g. diagnostics server 108) track communication between the system components (e.g. components 104) in a distributed system and monitor for an alert indicating an error in the communication in the distributed computer system. In one aspect, monitoring for the alert includes applying monitoring rules to the communication to proactively detect errors in the distributed system by monitoring for the communication between the components matching a specific error pattern. In one aspect, the computing device may further be configured to obtain the monitoring rules which include data integrity information for each of the types of communications between the system components. The monitoring rules may be used to verify whether the health logs comply with the data integrity information (e.g. to determine whether the data being communicated or otherwise transacted is consistent and accurate over the lifecycle of a particular task).
- In one aspect, the health monitoring rules may further be defined based on historical error patterns for the communications in the distributed computer system. That is, a pattern set of pre-defined communication traffic flows for messages between the system components which may occur in each of the health logs may be mapped to particular error types. Thus, when a defined communication traffic flow is detected, it may be mapped to a particular error pattern thereby allowing further characterization of the error by error type including possible resolution.
- Operations 304-308 of the computing device are triggered in response to detecting the presence of the error. At 304, upon detecting the error, operations of the computing device trigger receiving a health log from each of the system components (e.g. 104A-104N, collectively 104) together defining an aggregate health log. The health logs may be in a standardized format (e.g. JSON format) and utilize common key identifiers (e.g. connection identifier, digital identifier of a transaction, etc.). This allows consistency in the information communicated and tracking of the messages such that it can be used to determine a context of the messages and mapped to capture the key identifiers across the distributed components. In one aspect, the common key identifiers are used by the computing device for tracing a route of the messages communicated between the distributed system components and particularly, for a transaction having the error. Additionally, in one aspect, the health logs may follow a particular log pattern with one or more metadata (e.g. customer identification number, traceability identification number, timestamp, event information, etc.) which allows tracking and identification of messages communicated with the distributed system components. An example of the format of the health logs in shown in
FIG. 5 . - At 306 and further in response to detecting the error, operations of the computing device (e.g. diagnostics server 108) configure receiving from a data store of the computing device, network infrastructure information defining one or more relationship for connectivity and communication flow between the system components. The relationships characterize dependency information between the system components. The network infrastructure information may indicate for example, how the components are connected to one another and for a set of defined operations, how they are dependent upon and utilize resources of another component in order to perform the defined operation.
- At 308, operations of the computing device automatically determine at least based on the aggregate health log and the network infrastructure information, a particular component of the system components originating the error and associated dependent components affected from the system components.
- In a further aspect, automatically determining the origination of the error in a distributed component system includes comparing each of the health logs to the other health logs in response to the relationships in the network infrastructure information and may include mapping the information to predefined patterns for the logs to determine where the deviations from the expected communications may have occurred.
- Referring to
FIG. 4 , shown is an example scenario for flow of messages between distributed system components located both internal to an organization (e.g. on a private network) and remote to the organization (e.g. outside the private network).FIG. 4 further illustrates monitoring of health of the distributed components including error source origination detection for an error occurring in the message. As shown inFIG. 4 , flow of messages may occur betweeninternal system components 104A-104C located on a first computing device (e.g. computing device 102 ofFIG. 1 ) andcomponent 104D of an external computing device (e.g. asecond computing device 102′) located outside the institution provided by systems A-C. Other variations of distributions of the system components on computing devices may be envisaged. For example, eachsystem component 104A-104D may reside on distinct computing devices altogether. - The path of a message is shown as travelling across
link 401A to 401B to 401C. - Thus, as described above, the
automatic analyzer module 214 initially receives a set of API logs (e.g.aggregate health logs 107A-107C characterizing message activity forsystem components 104A-104D, events and/or errors communicated acrosslinks 401A-401C) in a standardized format. The standardized format may be JSON and one or more key identifiers that link together the API logs as being related to a task or operation. -
FIG. 5 illustrates example API logs 501-503 (a type ofhealth logs 107A-107C) which may be communicated betweensystem components 104 such assystem components 104A-104D ofFIG. 4 . For example, each API log from anAPI system component 104 would include API event information such as interactions with the API including calls or requests and its content. The API logs further include atimestamp 504 indicating a time of message and atraceability ID 506 which allows tracking a message path from one API to another (e.g. as shown in API logs 501-503). - For example, a message sent from a first API to a second API would have the same traceability ID (or at least a common portion in the traceability ID 506) with
different timestamps 504. As noted above, when an error is detected in the overall system (e.g. error 507 in API log 503), the API logs 501-503 for all of the system components are reviewed at theautomatic analyzer module 214. Additionally, theautomatic analyzer module 214 receivesnetwork infrastructure info 111 metadata which defines relationships between thevarious API components 104 in the system including which component systems are dependent on others for each pre-defined type of action (e.g. message communication, performing a particular task, accessing a resource, etc.). Further, theautomatic analyzer module 214 may retrieve from adata store 116, a set ofhealth monitoring rules 109 which can define historical error patterns (e.g. an error of type X typically follows a path fromAPI 1 to API 2) to recognize and diagnose errors. For example, the set ofhealth monitoring rules 109 may map a traffic pattern between the API logs (e.g. API logs 501-503) to a particular type of error. - Thus, referring again to
FIGS. 4 and 5 , once an error is detected in the overall system (e.g. the error 507), theautomatic analyzer module 214 utilizes the aggregate API logs 107A-107C (e.g. received from each of the system components having the same traceability ID), thenetwork infrastructure information 111 and the monitoring rules 109 to determine which of the system components originated the error, characterizations of the error (e.g. based on historical error patterns) and associated dependent components directly affected by theerror 507. The disclosed method and system allows diagnosis of health of application data communicated between APIs and locating the errors for subsequent analysis, in one or more aspects. - Subsequent to the above automatic determination of application health by the
automatic analyzer module 214, including characterizing the error 507 (e.g. based onmonitoring rules 109 characterizing prior error issues and types communicated betweensystem components 104A-104D) along with which component(s) are responsible for theerror 507 in the system (e.g. based on digesting the network infrastructure info 111) and associated components, the system may provide the diagnostic results as an alert to a user interface. The user interface may be associated with theautomatic analyzer module 214 so that a user (e.g. system support) can see which API(s) are having issues and determine corrective measures. The user interface may display the results either on thediagnostics server 108 or any of thecomputing devices 102 for further action. This allows thenetwork 100 shown inFIG. 1 to monitor its distributedcomponents 104 and be proactive in providing error notification diagnostics for their systems support. The alert may be an email, a text message, a video message or any other types of visual displays as envisaged by a person skilled in the art. In one aspect, the alert may be displayed on a particular device based on the particular component originating the error as determined from the received health logs. In a further aspect, the alert is displayed on the user interface along with metadata characterizing the error including associated dependent components to the particular component originating the error. - Referring to
FIGS. 4-6 , an example of theautomatic analyzer module 214 generating and sending such an alert 600 to a computing device (e.g. 102, 102′ or 106, etc.) responsible for error resolution in thesystem component 104 which generated the error is shown inFIG. 6 . In the case ofFIG. 6 , theautomatic analyzer module 214 is configured to generate an email to the operations or support team (e.g. provided via a unified messaging platform and accessible via the computing devices inFIG. 1 ) detailing the error and reasoning for the error for subsequent resolution thereof. - Referring now to
FIG. 7 shown is an example flow ofmessages 700, provided in at least one aspect, shown as Message(1)-Message(3) communicated in thenetwork 100 ofFIG. 1 between distributedsystem components 104A-104C (e.g. web tier(s) and API components) associated with distinct computing devices 102A, 102B, and 102C, collectively referred to as 102. The health of the distributed applications is monitored viahealth logs 107A-107C (e.g. asyncMessage(1)-asyncMessage(3)) and subsequently analyzed by thediagnostics server 108 via the automatic analyzer module 214 (e.g. also referred to as UDD—unified deep diagnostic analytics). As noted above, thehealth logs 107 may utilize a standardized JSON format defining a unified smart log pattern (USLP). The unified smart log pattern of thehealth logs 107 may enable a better understanding of the flow of messages; provide an indication of functional dependencies between the system components; and utilize a linking key metadata that connects messages via a common identifier (e.g. customer ID). - Additionally, as noted above, the
automatic analyzer module 214 monitors the health logs and may apply a set of monitoring rules (e.g. monitoring rules 109 inFIG. 1 ) to detect errors including the origination source via pre-defined error patterns shown atstep 702 and the expected operational resolution. In at least some aspects, the monitoring rules 109 applied by theanalytics analyzer module 214 may include a decision tree or other machine learning trained model which utilizes prior error patterns to predict the error pattern in the current flow ofmessages 700. The results of the error analysis may be provided to a user interface atstep 704, e.g. via anothercomputing device 706 for further resolution. An example of the notification provided atstep 704 to theother computer 706 responsible for providing system support and error resolution for the system component which originated the error is shown inFIG. 6 . The notification provided atstep 704 may be provided via e-mail, short message service (SMS), a graphical user interface (GUI), a dashboard (e.g. a type of GUI providing high level view of performance indicators), etc. - In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit.
- Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, computer-readable media generally may correspond to (1) tangible computer-readable storage media, which is non-transitory or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may include a computer-readable medium. By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using wired or wireless technologies, such are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but are instead directed to non-transient, tangible storage media.
- Instructions may be executed by one or more processors, such as one or more general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), digital signal processors (DSPs), or other similar integrated or discrete logic circuitry. The term “processor,” as used herein may refer to any of the foregoing examples or any other suitable structure to implement the described techniques. In addition, in some aspects, the functionality described may be provided within dedicated software modules and/or hardware. Also, the techniques could be fully implemented in one or more circuits or logic elements. The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, an integrated circuit (IC) or a set of ICs (e.g., a chip set).
- One or more currently preferred embodiments have been described by way of example. It will be apparent to persons skilled in the art that a number of variations and modifications can be made without departing from the scope of the invention as defined in the claims.
Claims (29)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/139,101 US20230259436A1 (en) | 2020-07-10 | 2023-04-25 | Systems and methods for monitoring application health in a distributed architecture |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/925,862 US11669423B2 (en) | 2020-07-10 | 2020-07-10 | Systems and methods for monitoring application health in a distributed architecture |
US18/139,101 US20230259436A1 (en) | 2020-07-10 | 2023-04-25 | Systems and methods for monitoring application health in a distributed architecture |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/925,862 Continuation US11669423B2 (en) | 2020-07-10 | 2020-07-10 | Systems and methods for monitoring application health in a distributed architecture |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230259436A1 true US20230259436A1 (en) | 2023-08-17 |
Family
ID=79173681
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/925,862 Active 2040-08-13 US11669423B2 (en) | 2020-07-10 | 2020-07-10 | Systems and methods for monitoring application health in a distributed architecture |
US18/139,101 Pending US20230259436A1 (en) | 2020-07-10 | 2023-04-25 | Systems and methods for monitoring application health in a distributed architecture |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/925,862 Active 2040-08-13 US11669423B2 (en) | 2020-07-10 | 2020-07-10 | Systems and methods for monitoring application health in a distributed architecture |
Country Status (1)
Country | Link |
---|---|
US (2) | US11669423B2 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11669423B2 (en) * | 2020-07-10 | 2023-06-06 | The Toronto-Dominion Bank | Systems and methods for monitoring application health in a distributed architecture |
KR20220080915A (en) * | 2020-12-08 | 2022-06-15 | 삼성전자주식회사 | Method for operating storage device and host device, and storage device |
US11347579B1 (en) * | 2021-04-29 | 2022-05-31 | Bank Of America Corporation | Instinctive slither application assessment engine |
US11949571B2 (en) * | 2021-05-24 | 2024-04-02 | Dell Products L.P. | Unified telemetry data |
US12093389B2 (en) * | 2022-03-14 | 2024-09-17 | Microsoft Technology Licensing, Llc | Data traffic characterization prioritization |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190258535A1 (en) * | 2018-02-22 | 2019-08-22 | Red Hat, Inc. | Determining relationships between components in a computing environment to facilitate root-cause analysis |
US20200409831A1 (en) * | 2019-06-27 | 2020-12-31 | Capital One Services, Llc | Testing agent for application dependency discovery, reporting, and management tool |
US11669423B2 (en) * | 2020-07-10 | 2023-06-06 | The Toronto-Dominion Bank | Systems and methods for monitoring application health in a distributed architecture |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6418445B1 (en) * | 1998-03-06 | 2002-07-09 | Perot Systems Corporation | System and method for distributed data collection and storage |
US6353902B1 (en) * | 1999-06-08 | 2002-03-05 | Nortel Networks Limited | Network fault prediction and proactive maintenance system |
GB2361382A (en) * | 2000-04-12 | 2001-10-17 | Mitel Corp | Tree hierarchy and description for generated logs |
US7590726B2 (en) * | 2003-11-25 | 2009-09-15 | Microsoft Corporation | Systems and methods for unifying and/or utilizing state information for managing networked systems |
US7523357B2 (en) | 2006-01-24 | 2009-04-21 | International Business Machines Corporation | Monitoring system and method |
US20080016115A1 (en) * | 2006-07-17 | 2008-01-17 | Microsoft Corporation | Managing Networks Using Dependency Analysis |
US8732530B2 (en) | 2011-09-30 | 2014-05-20 | Yokogawa Electric Corporation | System and method for self-diagnosis and error reporting |
US9104565B2 (en) | 2011-12-29 | 2015-08-11 | Electronics And Telecommunications Research Institute | Fault tracing system and method for remote maintenance |
US20140122930A1 (en) | 2012-10-25 | 2014-05-01 | International Business Machines Corporation | Performing diagnostic tests in a data center |
US9652316B2 (en) * | 2015-03-31 | 2017-05-16 | Ca, Inc. | Preventing and servicing system errors with event pattern correlation |
US10353762B2 (en) * | 2015-06-11 | 2019-07-16 | Instana, Inc. | Hierarchical fault determination in an application performance management system |
US9529662B1 (en) | 2015-07-31 | 2016-12-27 | Netapp, Inc. | Dynamic rule-based automatic crash dump analyzer |
US10637745B2 (en) * | 2016-07-29 | 2020-04-28 | Cisco Technology, Inc. | Algorithms for root cause analysis |
US10127125B2 (en) * | 2016-10-21 | 2018-11-13 | Accenture Global Solutions Limited | Application monitoring and failure prediction |
US10338986B2 (en) | 2016-10-28 | 2019-07-02 | Microsoft Technology Licensing, Llc | Systems and methods for correlating errors to processing steps and data records to facilitate understanding of errors |
US10977154B2 (en) * | 2018-08-03 | 2021-04-13 | Dynatrace Llc | Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data |
US10747544B1 (en) * | 2019-06-27 | 2020-08-18 | Capital One Services, Llc | Dependency analyzer in application dependency discovery, reporting, and management tool |
-
2020
- 2020-07-10 US US16/925,862 patent/US11669423B2/en active Active
-
2023
- 2023-04-25 US US18/139,101 patent/US20230259436A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190258535A1 (en) * | 2018-02-22 | 2019-08-22 | Red Hat, Inc. | Determining relationships between components in a computing environment to facilitate root-cause analysis |
US20200409831A1 (en) * | 2019-06-27 | 2020-12-31 | Capital One Services, Llc | Testing agent for application dependency discovery, reporting, and management tool |
US11669423B2 (en) * | 2020-07-10 | 2023-06-06 | The Toronto-Dominion Bank | Systems and methods for monitoring application health in a distributed architecture |
Non-Patent Citations (2)
Title |
---|
Google Scholar//Patents search - text refined (Year: 2024) * |
Google Scholar/Patents search - text refined (Year: 2023) * |
Also Published As
Publication number | Publication date |
---|---|
US20220012143A1 (en) | 2022-01-13 |
US11669423B2 (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11669423B2 (en) | Systems and methods for monitoring application health in a distributed architecture | |
CN107391379B (en) | Automatic interface testing method and device | |
US10346282B2 (en) | Multi-data analysis based proactive defect detection and resolution | |
US9383900B2 (en) | Enabling real-time operational environment conformity to an enterprise model | |
Xu et al. | POD-Diagnosis: Error diagnosis of sporadic operations on cloud applications | |
US9413597B2 (en) | Method and system for providing aggregated network alarms | |
US20040167793A1 (en) | Network monitoring method for information system, operational risk evaluation method, service business performing method, and insurance business managing method | |
US8661125B2 (en) | System comprising probe runner, monitor, and responder with associated databases for multi-level monitoring of a cloud service | |
CN113312241A (en) | Abnormal alarm method, access log generation method and operation and maintenance system | |
US10984109B2 (en) | Application component auditor | |
EP4182796B1 (en) | Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph | |
AU2018202153A1 (en) | System and method for tool chain data capture through parser for empirical data analysis | |
CN111045935A (en) | Automatic version auditing method, device, equipment and storage medium | |
CN113448795B (en) | Method, apparatus and computer program product for obtaining system diagnostic information | |
CN110347565B (en) | Application program abnormity analysis method and device and electronic equipment | |
CN115952081A (en) | Software testing method, device, storage medium and equipment | |
CN111897723A (en) | Method and device for testing application | |
US9632904B1 (en) | Alerting based on service dependencies of modeled processes | |
CN116841902A (en) | Health state checking method, device, equipment and storage medium | |
CA3086660A1 (en) | Systems and methods for monitoring application health in a distributed architecture | |
ZHANG et al. | Approach to anomaly detection in microservice system with multi-source data streams | |
CN115437961A (en) | Data processing method and device, electronic equipment and storage medium | |
WO2023022805A1 (en) | Intelligent cloud service health communication to customers | |
CN113127362A (en) | Object testing method, object testing device, electronic device, and readable storage medium | |
JP6989477B2 (en) | Repeated failure prevention device, repeated failure prevention system and repeated failure prevention method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: THE TORONTO-DOMINION BANK, CANADA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MEERAN, AHAMED PS;BHATTACHARYA, SOMAK;SIGNING DATES FROM 20230227 TO 20230309;REEL/FRAME:063466/0702 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |