WO2019046996A1 - Java software latency anomaly detection - Google Patents
Java software latency anomaly detection Download PDFInfo
- Publication number
- WO2019046996A1 WO2019046996A1 PCT/CN2017/100457 CN2017100457W WO2019046996A1 WO 2019046996 A1 WO2019046996 A1 WO 2019046996A1 CN 2017100457 W CN2017100457 W CN 2017100457W WO 2019046996 A1 WO2019046996 A1 WO 2019046996A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- application
- application method
- abnormal
- time
- profile data
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45504—Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/302—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G06F11/3419—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3452—Performance evaluation by statistical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/865—Monitoring of software
Definitions
- any latencies in such transactions have a range of consequences.
- small scale transactions such as a personal online shopping
- associated consequences may be limited to the individual user being annoyed from a lack of response.
- large scale transactions such as commercial, industrial, or security transactions
- the associated consequences may result in a critical commercial or security transaction failing to complete.
- many interacting devices are often updated or replaced with newer software and/or hardware, and unintended incompatibility among the devices may introduce latencies.
- Data center software developers and operators may make use of software method latencies to figure out sources of bad response times, where each software method, or application method, is a set of code which is referred to by its own name and can be called or invoked at any point in a program by utilizing the name of the software method. While commercially available tools, such as Java Development Kit (JDK) , can collect application method latencies, analysis of large amount of such data may be difficult.
- JDK Java Development Kit
- a Java method tracing capability implemented in a customized JDK product may provide diagnostic and profiling options to track the performance of Java applications via instrumentation.
- instructions may be added at the entry and exit of a compiled Java method by modifying interpreter and compilers in a Java Virtual Machine (JVM) , which allows users to track the entry time and exit time of an execution of the Java application method.
- JVM Java Virtual Machine
- the Java method tracing capability may also provide application programming interfaces (APIs) for profiling a system to specify which thread could be traced, and allow enabling and disabling of this tracing ability on the fly per thread basis.
- APIs application programming interfaces
- the Java method tracing capability may allow the user to get comprehensive information regarding a code flow in a Java application.
- Java profiling such as 1) how to automatically detect anomalous performance behaviors, 2) how to automatically identify a method with abnormal high-latency behavior, 3) how to identify workload or incoming web requests affected by a long-latency behavior in a responsive and timely manner, and 4) how to prioritize identified issues.
- FIG. 1 illustrates an example environment in which a software latency anomaly may be detected.
- FIG. 2 illustrates an example flow diagram for identifying an application method having a software latency anomaly.
- FIG. 3 illustrates an example process detailing one of the blocks of FIG. 2.
- FIG. 4 illustrates an example process detailing one of the blocks of FIG. 2.
- FIG. 5 illustrates an example block diagram of a system for identifying an application method having an abnormal latency.
- Systems and methods discussed herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods.
- the systems and methods discussed herein allow users to automatically identify perceptible performance inefficiency by focusing on latency of an application method without stepping through a code, which may be used by companies running large data centers and using Java as their primary development language.
- FIG. 1 illustrates an example environment 100 in which a software latency anomaly in an application may be detected.
- Software latency may be interchangeably referred as application latency, and a software method may be interchangeably referred as an application method.
- a data center 102 (which may be a single computing device or a cloud computing center comprising a plurality of computing devices and/or servers) is communicatively coupled, via a network 104, to a plurality user devices, of which four of the user devices 106, 108, 110, and 112 are shown.
- the data center 102 and the plurality of user devices 106, 108, 110, and 112 are also communicatively coupled to a monitoring computing device 106 via the network 104.
- the monitoring computing device 106 is configured to monitor the application running in the data center 102 for abnormal latencies of application methods in the application.
- the application running in the data center 102 that is monitored by the monitoring computing device 114 may be a Java application.
- FIG. 2 illustrates an example flow diagram 200 for identifying an application method having a software latency anomaly.
- profile data of the application may be collected.
- the profile data may comprise various parameters associated with the application, such as, but not limited to 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one service to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with that application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself.
- a predetermined set of data may be excluded from the profile data for further evaluation.
- the predetermined set of data to be excluded may comprise 1) application methods having incomplete call stacks, which are known to occur very rarely 2) application standard libraries which are commonly available and used, and not unique to the application running, and 3) application methods having inclusive times that are less than a threshold time, which are considered to be behaving normally.
- the application methods included in the remaining profile data may be statistically analyzed to identify a first set of application methods having abnormal latencies, and in block 208, the application methods included in the remaining profile data may be analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies may then be generated in block 210 based on at least one of the first set of application methods or the second set of application methods.
- FIG. 3 illustrates an example process 300 detailing statistically analysis performed in block 206 of FIG. 2.
- an application method at the bottom of each call stack (bottom application method) may be selected in order of when the call stack was created, and in block 304, all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack may be selected.
- the selected TraceIDs may then be grouped into two groups in block 306, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range.
- the TraceIds with the bottom application method whose inclusive time lays outside the outer fences may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
- a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group may be calculated.
- the call stacks with the same stack depths or the same first number may be considered to be the same. If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
- a critical MethodID may be identified by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal.
- the process may repeat from block 304 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 314, critical MethodIDs may be ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
- FIG. 4 illustrates an example process 400 detailing analysis using clustering algorithms performed in block 208 of FIG. 2.
- an application method based on an occurrence frequency of the application method in a log file of the application may be selected.
- the profile data associated with the selected application method such as associate TraceID, and associated inclusive and exclusive times, may be provided to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) .
- mean linkage clustering algorithm hierarchical clustering algorithm
- an outlier may be identified in block 406 based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods.
- the outlier may be identified using a machine-learning-based anomaly detection process.
- the profile data associated the selected application method may be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm, and in block 410, based on results from the DBSCAN algorithm, an outlier having a cluster with a low density may be identified.
- DBSCAN Density Based Clustering of Applications with Noise
- MethodIDs associated with the clusters identified as outliers and associated TraceIDs may be ranked based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
- the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
- t incl is the inclusive time of the application method associated with the identified MethodID
- median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods
- t excl is the exclusive time of the application method associated with the identified MethodID
- median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
- FIG. 5 illustrates an example block diagram of a system 502 for identifying an application method having an abnormal latency.
- the system 502 may be, or may reside in, the monitoring computing device 106.
- the system 502 may comprise one or more processors 504 and memory 506 coupled to the one or more processors 504.
- the memory 506 may comprise various modules communicatively coupled to each other which are executable by the one or more processors 504.
- the system 502 may be communicatively coupled to the data center 102 and the user device 1016, 108, 110, and 112 via the network 104.
- the memory 506 may comprise various modules that are communicatively coupled to each other.
- the various modules may comprise a profile data module 508, a statistical analysis module 510 a clustering algorithm module 512, and a list generator module 514. As discussed above with reference to FIG.
- the profile data module 508 may be configured to collect profile data of an application, such as a Java application, running in the data center 102, where the profile data may comprise 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one microservice to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with the application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself.
- the profile data module 508 may further be configured to exclude from the profile data, at least one of application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time.
- the statistical analysis module 510 may be configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies.
- the statistical analysis module 510 may select an application method at the bottom of each call stack (bottom application method) in order of when the call stack was created, and may select all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack.
- the statistical analysis module 510 may group the selected TraceIDs into two groups, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range.
- the TraceIds with the bottom application method whose inclusive time lays outside the outer fences may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
- the statistical analysis module 510 may then calculate a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group.
- the call stacks with the same stack depths or the same first number may be considered to be the same, If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
- the statistical analysis module 510 may then identify a critical MethodID by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal.
- the statistical analysis module 510 may repeat the process above until all application methods are evaluated, and may rank critical MethodIDs ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
- the clustering algorithm module 512 may select an application method based on an occurrence frequency of the application method in a log file of the application.
- the profile data associated with the selected application method such as associate TraceID, and associated inclusive and exclusive times, are provided to the hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) of the clustering algorithm module 512.
- the clustering algorithm module 512 may then identify an outlier based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods.
- the outlier may be identified using a machine-learning-based anomaly detection process.
- the profile data associated the selected application method may also be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm of the clustering algorithm module 512. Based on results from the DBSCAN algorithm, the clustering algorithm module 512 may identify an outlier having a cluster with a low density.
- DBSCAN Density Based Clustering of Applications with Noise
- the clustering algorithm module 510 may repeat the process above until all application methods are evaluated, and may rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
- the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
- t incl is the inclusive time of the application method associated with the identified MethodID
- median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods
- t excl is the exclusive time of the application method associated with the identified MethodID
- median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
- Computer-readable instructions include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like.
- Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.
- the computer-readable storage media may include volatile memory (such as random access memory (RAM) ) and/or non-volatile memory (such as read-only memory (ROM) , flash memory, etc. ) .
- volatile memory such as random access memory (RAM)
- non-volatile memory such as read-only memory (ROM) , flash memory, etc.
- the computer-readable storage media may also include additional removable storage and/or non-removable storage including, but not limited to, flash memory, magnetic storage, optical storage, and/or tape storage that may provide non-volatile storage of computer-readable instructions, data structures, program modules, and the like.
- a non-transient computer-readable storage medium is an example of computer-readable media.
- Computer-readable media includes at least two types of computer-readable media, namely computer-readable storage media and communications media.
- Computer-readable storage media includes volatile and non-volatile, removable and non-removable media implemented in any process or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data.
- Computer-readable storage media includes, but is not limited to, phase change memory (PRAM) , static random-access memory (SRAM) , dynamic random-access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technology, compact disk read-only memory (CD-ROM) , digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.
- communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer-readable storage media do not include communication media.
- the computer-readable instructions stored on one or more non-transitory computer-readable storage media that, when executed by one or more processors, may perform operations described above with reference to FIGs. 2-5.
- computer-readable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.
- a method for identifying an application method, having an abnormal latency, of an application running in a data center comprising: collecting profile data of the application; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
- a method as paragraph A recites, wherein the application method is a Java application method and the application running in the data center is a Java application.
- a method as paragraph A recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.
- the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
- a method paragraph D recites, wherein the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
- a method as paragraph E recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created; selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical Method
- analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods; providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian
- a method as paragraph G recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: where distance (t incl , t excl ) is the Euclidean distance, t incl is the inclusive time of the application method associated with the identified MethodID, median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods, t excl is the exclusive time of the application method associated with the identified MethodID, and median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
- One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising: collecting profile data of an application running in a data center; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
- One or more non-transitory computer-readable storage media as I recites wherein the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
- the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
- analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated
- DBSCAN Density Based Clustering of Applications with Noise
- the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: wherein: distance (t incl , t excl ) is the Euclidean distance, t incl is the inclusive time of the application method associated with the identified MethodID, median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods, t excl is the exclusive time of the application method associated with the identified MethodID, and median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
- a system for identifying an application method having an abnormal latency comprising: one or more processors; and memory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising: a profile data module configured to collect profile data of an application running in a data center and to exclude, from the profile data, at least one of: application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time; a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and a list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
- the statistical analysis module is further configured to: select a bottom application method at a bottom of each call stack in order of when a call stack is created; select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal
- the clustering algorithm module is further configured to: select an application method based on an occurrence frequency of the application method in a log file of the application; provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified Method
- T A system as paragraph S recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: wherein: distance (t incl , t excl ) is the Euclidean distance, t incl is the inclusive time of the application method associated with the identified MethodID, median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods, t excl is the exclusive time of the application method associated with the identified MethodID, and median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Debugging And Monitoring (AREA)
Abstract
Systems and methods provided herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods. Profile data of an application running in a computing device is collected and a predetermined set of data is excluded from the profile data. Application methods included in remaining profile data are analyzed statistically to identify a first set of application methods having abnormal latencies, and analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods is then generated.
Description
As the use of computers and software running on the computers continue to grow in everyday transactions, any latencies in such transactions have a range of consequences. For small scale transactions, such as a personal online shopping, associated consequences may be limited to the individual user being annoyed from a lack of response. For large scale transactions, such as commercial, industrial, or security transactions, the associated consequences may result in a critical commercial or security transaction failing to complete. Additionally, many interacting devices are often updated or replaced with newer software and/or hardware, and unintended incompatibility among the devices may introduce latencies.
Data center software developers and operators may make use of software method latencies to figure out sources of bad response times, where each software method, or application method, is a set of code which is referred to by its own name and can be called or invoked at any point in a program by utilizing the name of the software method. While commercially available tools, such as Java Development Kit (JDK) , can collect application method latencies, analysis of large amount of such data may be difficult.
A Java method tracing capability implemented in a customized JDK product may provide diagnostic and profiling options to track the performance of Java applications via instrumentation. To minimize the overhead incurred by
instrumentation, instructions may be added at the entry and exit of a compiled Java method by modifying interpreter and compilers in a Java Virtual Machine (JVM) , which allows users to track the entry time and exit time of an execution of the Java application method. The Java method tracing capability may also provide application programming interfaces (APIs) for profiling a system to specify which thread could be traced, and allow enabling and disabling of this tracing ability on the fly per thread basis. The Java method tracing capability may allow the user to get comprehensive information regarding a code flow in a Java application.
However, there are challenges, or additional desired capabilities, regarding the Java profiling, such as 1) how to automatically detect anomalous performance behaviors, 2) how to automatically identify a method with abnormal high-latency behavior, 3) how to identify workload or incoming web requests affected by a long-latency behavior in a responsive and timely manner, and 4) how to prioritize identified issues.
The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.
FIG. 1 illustrates an example environment in which a software latency anomaly may be detected.
FIG. 2 illustrates an example flow diagram for identifying an application method having a software latency anomaly.
FIG. 3 illustrates an example process detailing one of the blocks of FIG. 2.
FIG. 4 illustrates an example process detailing one of the blocks of FIG. 2.
FIG. 5 illustrates an example block diagram of a system for identifying an application method having an abnormal latency.
Systems and methods discussed herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods.
The systems and methods discussed herein allow users to automatically identify perceptible performance inefficiency by focusing on latency of an application method without stepping through a code, which may be used by companies running large data centers and using Java as their primary development language.
FIG. 1 illustrates an example environment 100 in which a software latency anomaly in an application may be detected. Software latency may be
interchangeably referred as application latency, and a software method may be interchangeably referred as an application method. A data center 102 (which may be a single computing device or a cloud computing center comprising a plurality of computing devices and/or servers) is communicatively coupled, via a network 104, to a plurality user devices, of which four of the user devices 106, 108, 110, and 112 are shown. The data center 102 and the plurality of user devices 106, 108, 110, and 112 are also communicatively coupled to a monitoring computing device 106 via the network 104. The monitoring computing device 106 is configured to monitor the application running in the data center 102 for abnormal latencies of application methods in the application. The application running in the data center 102 that is monitored by the monitoring computing device 114 may be a Java application.
FIG. 2 illustrates an example flow diagram 200 for identifying an application method having a software latency anomaly.
In block 202, profile data of the application, such as a Java application running in the data center 102, may be collected. The profile data may comprise various parameters associated with the application, such as, but not limited to 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one service to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval
spent by an application method and one or more called application methods associated with that application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself.
In block 204, a predetermined set of data may be excluded from the profile data for further evaluation. The predetermined set of data to be excluded may comprise 1) application methods having incomplete call stacks, which are known to occur very rarely 2) application standard libraries which are commonly available and used, and not unique to the application running, and 3) application methods having inclusive times that are less than a threshold time, which are considered to be behaving normally.
In block 206, the application methods included in the remaining profile data may be statistically analyzed to identify a first set of application methods having abnormal latencies, and in block 208, the application methods included in the remaining profile data may be analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies may then be generated in block 210 based on at least one of the first set of application methods or the second set of application methods.
FIG. 3 illustrates an example process 300 detailing statistically analysis performed in block 206 of FIG. 2.
In block 302, an application method at the bottom of each call stack (bottom application method) may be selected in order of when the call stack was
created, and in block 304, all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack may be selected. The selected TraceIDs may then be grouped into two groups in block 306, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range. The TraceIds with the bottom application method whose inclusive time lays outside the outer fences (Q1 –3 IQR, Q3 + 3 IQR) may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
In block 308, a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group may be calculated. The call stacks with the same stack depths or the same first number may be considered to be the same. If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
In block 310, a critical MethodID may be identified by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal. In block 312, the process may repeat from block 304 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 314, critical MethodIDs may be ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
FIG. 4 illustrates an example process 400 detailing analysis using clustering algorithms performed in block 208 of FIG. 2.
In block 402, an application method based on an occurrence frequency of the application method in a log file of the application may be selected. In block 404, the profile data associated with the selected application method, such as associate TraceID, and associated inclusive and exclusive times, may be provided to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) . Based on results from the hierarchical clustering algorithm in block 404, an outlier may be identified in block 406 based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods. The outlier may be identified using a machine-learning-based anomaly detection process.
In block 408, the profile data associated the selected application method may be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm, and in block 410, based on results from the DBSCAN algorithm, an outlier having a cluster with a low density may be identified.
In block 412, the process repeats from block 402 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 414, MethodIDs associated with the clusters identified as outliers and associated TraceIDs may be ranked based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
The Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
where:
distance (tincl, texcl) is the Euclidean distance,
tincl is the inclusive time of the application method associated with the identified MethodID,
median (tincl) is a median of inclusive times of the application method in an overall population of the application methods,
texcl is the exclusive time of the application method associated with the identified MethodID, and
median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
FIG. 5 illustrates an example block diagram of a system 502 for identifying an application method having an abnormal latency. The system 502 may be, or may reside in, the monitoring computing device 106. The system 502 may comprise one or more processors 504 and memory 506 coupled to the one or more processors 504. The memory 506 may comprise various modules communicatively coupled to each other which are executable by the one or more processors 504. As illustrated in FIG. 1, the system 502 may be communicatively coupled to the data center 102 and the user device 1016, 108, 110, and 112 via the network 104.
The memory 506 may comprise various modules that are communicatively coupled to each other. The various modules may comprise a profile data module 508, a statistical analysis module 510 a clustering algorithm module 512, and a list generator module 514. As discussed above with reference to FIG. 2, the profile data module 508 may be configured to collect profile data of an application, such as a Java application, running in the data center 102, where the profile data may comprise 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one microservice to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data
center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with the application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself. The profile data module 508 may further be configured to exclude from the profile data, at least one of application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time.
As discussed above with reference to FIG. 3, the statistical analysis module 510 may be configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies. The statistical analysis module 510 may select an application method at the bottom of each call stack (bottom application method) in order of when the call stack was created, and may select all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack. The statistical analysis module 510 may group the selected TraceIDs into two groups, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range. The TraceIds
with the bottom application method whose inclusive time lays outside the outer fences (Q1 –3 IQR, Q3 + 3 IQR) may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
The statistical analysis module 510 may then calculate a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group. The call stacks with the same stack depths or the same first number may be considered to be the same, If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
The statistical analysis module 510 may then identify a critical MethodID by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal. The statistical analysis module 510 may repeat the process above until all application methods are evaluated, and may rank critical MethodIDs ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
As discussed above with reference to FIG. 4, the clustering algorithm module 512 may select an application method based on an occurrence frequency of the application method in a log file of the application. The profile data associated with the selected application method, such as associate TraceID, and associated inclusive and exclusive times, are provided to the hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) of the clustering algorithm module 512. The clustering algorithm module 512 may then identify an outlier based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods. The outlier may be identified using a machine-learning-based anomaly detection process.
The profile data associated the selected application method may also be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm of the clustering algorithm module 512. Based on results from the DBSCAN algorithm, the clustering algorithm module 512 may identify an outlier having a cluster with a low density.
The clustering algorithm module 510 may repeat the process above until all application methods are evaluated, and may rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
The Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
where:
distance (tincl, texcl) is the Euclidean distance,
tincl is the inclusive time of the application method associated with the identified MethodID,
median (tincl) is a median of inclusive times of the application method in an overall population of the application methods,
texcl is the exclusive time of the application method associated with the identified MethodID, and
median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
Some or all operations of the methods described above can be performed by execution of computer-readable instructions stored on a computer-readable storage medium, as defined below. The term “computer-readable instructions” as used in the description and claims, include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held
computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.
The computer-readable storage media may include volatile memory (such as random access memory (RAM) ) and/or non-volatile memory (such as read-only memory (ROM) , flash memory, etc. ) . The computer-readable storage media may also include additional removable storage and/or non-removable storage including, but not limited to, flash memory, magnetic storage, optical storage, and/or tape storage that may provide non-volatile storage of computer-readable instructions, data structures, program modules, and the like.
A non-transient computer-readable storage medium is an example of computer-readable media. Computer-readable media includes at least two types of computer-readable media, namely computer-readable storage media and communications media. Computer-readable storage media includes volatile and non-volatile, removable and non-removable media implemented in any process or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media includes, but is not limited to, phase change memory (PRAM) , static random-access memory (SRAM) , dynamic random-access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technology, compact disk read-only memory (CD-ROM) , digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any
other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer-readable storage media do not include communication media.
The computer-readable instructions stored on one or more non-transitory computer-readable storage media that, when executed by one or more processors, may perform operations described above with reference to FIGs. 2-5. Generally, computer-readable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.
EXAMPLE CLAUSES
A. A method for identifying an application method, having an abnormal latency, of an application running in a data center, the method comprising: collecting profile data of the application; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the
remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
B. A method as paragraph A recites, wherein the application method is a Java application method and the application running in the data center is a Java application.
C. A method as paragraph A recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.
D. A method as paragraph A recites, wherein the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
E. A method paragraph D recites, wherein the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
F. A method as paragraph E recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created; selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
G. A method as paragraph F recites, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting
an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods; providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
H. A method as paragraph G recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is:
where distance (tincl, texcl) is the Euclidean distance, tincl is the inclusive time of the application method associated with the identified MethodID, median (tincl) is a median of inclusive times of the application method in an overall population of the application methods, texcl is the exclusive time of the application method
associated with the identified MethodID, and median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
I. One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising: collecting profile data of an application running in a data center; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
J. One or more non-transitory computer-readable storage media as paragraph I recites, wherein the wherein the application method is a Java application method and the application running in the data center is a Java application.
K. One or more non-transitory computer-readable storage media as paragraph I recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.
L. One or more non-transitory computer-readable storage media as I recites, wherein the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
M. One or more non-transitory computer-readable storage media as paragraph L recites, the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
N. One or more non-transitory computer-readable storage media as paragraph M recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created; selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application
method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
O. One or more non-transitory computer-readable storage media as paragraph N recites, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; providing profile data associated the selected application method to a Density Based Clustering of
Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
P. One or more non-transitory computer-readable storage media as paragraph O recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is:
wherein: distance (tincl, texcl) is the Euclidean distance, tincl is the inclusive time of the application method associated with the identified MethodID, median (tincl) is a median of inclusive times of the application method in an overall population of the application methods, texcl is the exclusive time of the application method associated with the identified MethodID, and median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
Q. A system for identifying an application method having an abnormal latency, the system comprising: one or more processors; and memory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising: a profile data module configured to collect
profile data of an application running in a data center and to exclude, from the profile data, at least one of: application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time; a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and a list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
R. A system as paragraph Q recites, wherein the statistical analysis module is further configured to: select a bottom application method at a bottom of each call stack in order of when a call stack is created; select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and
an inclusive time of the application method in a same call stack from the normal TraceID group, and identify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and rank critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
S. A system as paragraph R recites, wherein the clustering algorithm module is further configured to: select an application method based on an occurrence frequency of the application method in a log file of the application; provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
T. A system as paragraph S recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is:
wherein: distance (tincl, texcl) is the Euclidean distance, tincl is the inclusive time of the application method associated with the identified MethodID, median (tincl) is a median of inclusive times of the application method in an overall population of the application methods, texcl is the exclusive time of the application method associated with the identified MethodID, and median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
CONCLUSION
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.
Claims (21)
- A method comprising:collecting profile data of an application running in a computing device;excluding a predetermined set of data from the profile data; andidentifying a first set of application methods having abnormal latencies.
- A method of claim 1, wherein the application method is a Java application method and the application running in the computing device is a Java application.
- A method of claim 1, wherein the computing device comprises one or more computing devices, one or more servers, or a cloud computing center.
- A method of claim 1, wherein the profile data comprises:a TraceID being a unique identifier of each web request;a ThreadID being a unique numeric identifier of each thread running in the computing device;a MethodID being a unique identifier for each application method;an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; andan exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
- A method of claim 4, wherein the predetermined set of data excluded from the profile data comprises at least one of:application methods having incomplete call stacks;application standard libraries; orapplication methods having inclusive times that are less than a threshold time.
- A method of claim 1, wherein identifying the first set of application methods having abnormal latencies comprises:statistically analyzing application methods includes in remaining profile data to identify the first set of application methods having abnormal latencies.
- A method of claim 6, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises:selecting a bottom application method at a bottom of each call stack in order of when a call stack is created;selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack;grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range;for all application methods in the remaining profile data:calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, andidentifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; andranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
- A method of claim 6, further comprising:analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; andgenerating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
- A method of claim 8, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises:selecting an application method based on an occurrence frequency of the application method in a log file of the application;providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ;based on results from the hierarchical clustering algorithm, identifying, as an outlier, a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods;providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm;based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; andranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
- One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising:collecting profile data of an application running in a computing device;excluding a predetermined set of data from the profile data; andidentifying a first set of application methods having abnormal latencies.
- One or more non-transitory computer-readable storage media of claim 10, wherein the application method is a Java application method and the application running in the computing device is a Java application.
- One or more non-transitory computer-readable storage media of claim 10, wherein the computing device comprises one or more computing devices, one or more servers, or a cloud computing center.
- One or more non-transitory computer-readable storage media of claim 10, wherein the profile data comprises:a TraceID being a unique identifier of each web request;a ThreadID being a unique numeric identifier of each thread running in the computing device;a MethodID being a unique identifier for each application method;an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; andan exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
- One or more non-transitory computer-readable storage media of claim 13, the predetermined set of data excluded from the profile data comprises at least one of:application methods having incomplete call stacks;application standard libraries; orapplication methods having inclusive times that are less than a threshold time.
- One or more non-transitory computer-readable storage media of claim 10, wherein identifying a first set of application methods having abnormal latencies comprise:statistically analyzing application methods includes in remaining profile data to identify the first set of application methods having abnormal latencies
- One or more non-transitory computer-readable storage media of claim 15, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises:selecting a bottom application method at a bottom of each call stack in order of when a call stack is created;selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack;grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range;for all application methods in the remaining profile data:calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, andidentifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; andranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
- One or more non-transitory computer-readable storage media of claim 16, wherein the operation further comprises:analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; andgenerating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
- One or more non-transitory computer-readable storage media of claim 17, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises:selecting an application method based on an occurrence frequency of the application method in a log file of the application;providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ;based on results from the hierarchical clustering algorithm, identifying, as an outlier, a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods;providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm;based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; andranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
- A system for identifying an application method having an abnormal latency, the system comprising:one or more processors; andmemory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising:a profile data module configured to collect profile data of an application running in a computing device and to exclude, from the profile data, at least one of:application methods having incomplete call stacks,application standard libraries, orapplication methods having inclusive times that are less than a threshold time;a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies;a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; anda list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
- A system of claim 19, wherein the statistical analysis module is further configured to:select a bottom application method at a bottom of each call stack in order of when a call stack is created;select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack;group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range;for all application methods in the remaining profile data:calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, andidentify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; andrank critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
- A system of claim 20, wherein the clustering algorithm module is further configured to:select an application method based on an occurrence frequency of the application method in a log file of the application;provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ;based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods;provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm;based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; andrank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/100457 WO2019046996A1 (en) | 2017-09-05 | 2017-09-05 | Java software latency anomaly detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/100457 WO2019046996A1 (en) | 2017-09-05 | 2017-09-05 | Java software latency anomaly detection |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019046996A1 true WO2019046996A1 (en) | 2019-03-14 |
Family
ID=65633414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/100457 WO2019046996A1 (en) | 2017-09-05 | 2017-09-05 | Java software latency anomaly detection |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019046996A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069354A (en) * | 2019-04-15 | 2019-07-30 | 必成汇(成都)科技有限公司 | The full link trace method of micro services and micro services framework |
CN111522900A (en) * | 2020-03-18 | 2020-08-11 | 携程计算机技术(上海)有限公司 | Method, system, device and storage medium for automatically analyzing unstructured data |
EP3929782A1 (en) * | 2020-06-26 | 2021-12-29 | Acronis International GmbH | Systems and methods for detecting behavioral anomalies in applications |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5872976A (en) * | 1997-04-01 | 1999-02-16 | Landmark Systems Corporation | Client-based system for monitoring the performance of application programs |
US7921410B1 (en) * | 2007-04-09 | 2011-04-05 | Hewlett-Packard Development Company, L.P. | Analyzing and application or service latency |
-
2017
- 2017-09-05 WO PCT/CN2017/100457 patent/WO2019046996A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5872976A (en) * | 1997-04-01 | 1999-02-16 | Landmark Systems Corporation | Client-based system for monitoring the performance of application programs |
US7921410B1 (en) * | 2007-04-09 | 2011-04-05 | Hewlett-Packard Development Company, L.P. | Analyzing and application or service latency |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069354A (en) * | 2019-04-15 | 2019-07-30 | 必成汇(成都)科技有限公司 | The full link trace method of micro services and micro services framework |
CN111522900A (en) * | 2020-03-18 | 2020-08-11 | 携程计算机技术(上海)有限公司 | Method, system, device and storage medium for automatically analyzing unstructured data |
CN111522900B (en) * | 2020-03-18 | 2023-09-01 | 携程计算机技术(上海)有限公司 | Automatic analysis method, system, equipment and storage medium for unstructured data |
EP3929782A1 (en) * | 2020-06-26 | 2021-12-29 | Acronis International GmbH | Systems and methods for detecting behavioral anomalies in applications |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Borghesi et al. | Anomaly detection using autoencoders in high performance computing systems | |
US10496468B2 (en) | Root cause analysis for protection storage devices using causal graphs | |
US9996409B2 (en) | Identification of distinguishable anomalies extracted from real time data streams | |
US10002144B2 (en) | Identification of distinguishing compound features extracted from real time data streams | |
US10210189B2 (en) | Root cause analysis of performance problems | |
US11457029B2 (en) | Log analysis based on user activity volume | |
CN110362612B (en) | Abnormal data detection method and device executed by electronic equipment and electronic equipment | |
US20190087737A1 (en) | Anomaly detection and automated analysis in systems based on fully masked weighted directed | |
Solaimani et al. | Statistical technique for online anomaly detection using spark over heterogeneous data from multi-source vmware performance data | |
EP3069241A1 (en) | Application execution path tracing with configurable origin definition | |
CN107203467A (en) | The reference test method and device of supervised learning algorithm under a kind of distributed environment | |
Muallem et al. | Hoeffding tree algorithms for anomaly detection in streaming datasets: A survey | |
US10866804B2 (en) | Recommendations based on the impact of code changes | |
EP3503473B1 (en) | Server classification in networked environments | |
US11900248B2 (en) | Correlating data center resources in a multi-tenant execution environment using machine learning techniques | |
CN113965389B (en) | Network security management method, device and medium based on firewall log | |
US20170270424A1 (en) | Method of Estimating Program Speed-Up in Highly Parallel Architectures Using Static Analysis | |
Madireddy et al. | Machine learning based parallel I/O predictive modeling: A case study on Lustre file systems | |
WO2019046996A1 (en) | Java software latency anomaly detection | |
US20200104233A1 (en) | System operational analytics using normalized likelihood scores | |
US10942832B2 (en) | Real time telemetry monitoring tool | |
Šikić et al. | Improving software defect prediction by aggregated change metrics | |
CN115705501A (en) | Hyper-parametric spatial optimization of machine learning data processing pipeline | |
CN116149926A (en) | Abnormality monitoring method, device, equipment and storage medium for business index | |
Enes et al. | A pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17924354 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17924354 Country of ref document: EP Kind code of ref document: A1 |