WO2019046996A1 - Java software latency anomaly detection - Google Patents

Java software latency anomaly detection Download PDF

Info

Publication number
WO2019046996A1
WO2019046996A1 PCT/CN2017/100457 CN2017100457W WO2019046996A1 WO 2019046996 A1 WO2019046996 A1 WO 2019046996A1 CN 2017100457 W CN2017100457 W CN 2017100457W WO 2019046996 A1 WO2019046996 A1 WO 2019046996A1
Authority
WO
WIPO (PCT)
Prior art keywords
application
application method
abnormal
time
profile data
Prior art date
Application number
PCT/CN2017/100457
Other languages
French (fr)
Inventor
Kingsum Chow
Wanyi ZHU
Chuansheng LU
Jiapeng LI
Sanhong Li
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to PCT/CN2017/100457 priority Critical patent/WO2019046996A1/en
Publication of WO2019046996A1 publication Critical patent/WO2019046996A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/865Monitoring of software

Definitions

  • any latencies in such transactions have a range of consequences.
  • small scale transactions such as a personal online shopping
  • associated consequences may be limited to the individual user being annoyed from a lack of response.
  • large scale transactions such as commercial, industrial, or security transactions
  • the associated consequences may result in a critical commercial or security transaction failing to complete.
  • many interacting devices are often updated or replaced with newer software and/or hardware, and unintended incompatibility among the devices may introduce latencies.
  • Data center software developers and operators may make use of software method latencies to figure out sources of bad response times, where each software method, or application method, is a set of code which is referred to by its own name and can be called or invoked at any point in a program by utilizing the name of the software method. While commercially available tools, such as Java Development Kit (JDK) , can collect application method latencies, analysis of large amount of such data may be difficult.
  • JDK Java Development Kit
  • a Java method tracing capability implemented in a customized JDK product may provide diagnostic and profiling options to track the performance of Java applications via instrumentation.
  • instructions may be added at the entry and exit of a compiled Java method by modifying interpreter and compilers in a Java Virtual Machine (JVM) , which allows users to track the entry time and exit time of an execution of the Java application method.
  • JVM Java Virtual Machine
  • the Java method tracing capability may also provide application programming interfaces (APIs) for profiling a system to specify which thread could be traced, and allow enabling and disabling of this tracing ability on the fly per thread basis.
  • APIs application programming interfaces
  • the Java method tracing capability may allow the user to get comprehensive information regarding a code flow in a Java application.
  • Java profiling such as 1) how to automatically detect anomalous performance behaviors, 2) how to automatically identify a method with abnormal high-latency behavior, 3) how to identify workload or incoming web requests affected by a long-latency behavior in a responsive and timely manner, and 4) how to prioritize identified issues.
  • FIG. 1 illustrates an example environment in which a software latency anomaly may be detected.
  • FIG. 2 illustrates an example flow diagram for identifying an application method having a software latency anomaly.
  • FIG. 3 illustrates an example process detailing one of the blocks of FIG. 2.
  • FIG. 4 illustrates an example process detailing one of the blocks of FIG. 2.
  • FIG. 5 illustrates an example block diagram of a system for identifying an application method having an abnormal latency.
  • Systems and methods discussed herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods.
  • the systems and methods discussed herein allow users to automatically identify perceptible performance inefficiency by focusing on latency of an application method without stepping through a code, which may be used by companies running large data centers and using Java as their primary development language.
  • FIG. 1 illustrates an example environment 100 in which a software latency anomaly in an application may be detected.
  • Software latency may be interchangeably referred as application latency, and a software method may be interchangeably referred as an application method.
  • a data center 102 (which may be a single computing device or a cloud computing center comprising a plurality of computing devices and/or servers) is communicatively coupled, via a network 104, to a plurality user devices, of which four of the user devices 106, 108, 110, and 112 are shown.
  • the data center 102 and the plurality of user devices 106, 108, 110, and 112 are also communicatively coupled to a monitoring computing device 106 via the network 104.
  • the monitoring computing device 106 is configured to monitor the application running in the data center 102 for abnormal latencies of application methods in the application.
  • the application running in the data center 102 that is monitored by the monitoring computing device 114 may be a Java application.
  • FIG. 2 illustrates an example flow diagram 200 for identifying an application method having a software latency anomaly.
  • profile data of the application may be collected.
  • the profile data may comprise various parameters associated with the application, such as, but not limited to 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one service to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with that application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself.
  • a predetermined set of data may be excluded from the profile data for further evaluation.
  • the predetermined set of data to be excluded may comprise 1) application methods having incomplete call stacks, which are known to occur very rarely 2) application standard libraries which are commonly available and used, and not unique to the application running, and 3) application methods having inclusive times that are less than a threshold time, which are considered to be behaving normally.
  • the application methods included in the remaining profile data may be statistically analyzed to identify a first set of application methods having abnormal latencies, and in block 208, the application methods included in the remaining profile data may be analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies may then be generated in block 210 based on at least one of the first set of application methods or the second set of application methods.
  • FIG. 3 illustrates an example process 300 detailing statistically analysis performed in block 206 of FIG. 2.
  • an application method at the bottom of each call stack (bottom application method) may be selected in order of when the call stack was created, and in block 304, all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack may be selected.
  • the selected TraceIDs may then be grouped into two groups in block 306, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range.
  • the TraceIds with the bottom application method whose inclusive time lays outside the outer fences may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
  • a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group may be calculated.
  • the call stacks with the same stack depths or the same first number may be considered to be the same. If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
  • a critical MethodID may be identified by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal.
  • the process may repeat from block 304 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 314, critical MethodIDs may be ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
  • FIG. 4 illustrates an example process 400 detailing analysis using clustering algorithms performed in block 208 of FIG. 2.
  • an application method based on an occurrence frequency of the application method in a log file of the application may be selected.
  • the profile data associated with the selected application method such as associate TraceID, and associated inclusive and exclusive times, may be provided to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) .
  • mean linkage clustering algorithm hierarchical clustering algorithm
  • an outlier may be identified in block 406 based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods.
  • the outlier may be identified using a machine-learning-based anomaly detection process.
  • the profile data associated the selected application method may be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm, and in block 410, based on results from the DBSCAN algorithm, an outlier having a cluster with a low density may be identified.
  • DBSCAN Density Based Clustering of Applications with Noise
  • MethodIDs associated with the clusters identified as outliers and associated TraceIDs may be ranked based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
  • the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
  • t incl is the inclusive time of the application method associated with the identified MethodID
  • median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods
  • t excl is the exclusive time of the application method associated with the identified MethodID
  • median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
  • FIG. 5 illustrates an example block diagram of a system 502 for identifying an application method having an abnormal latency.
  • the system 502 may be, or may reside in, the monitoring computing device 106.
  • the system 502 may comprise one or more processors 504 and memory 506 coupled to the one or more processors 504.
  • the memory 506 may comprise various modules communicatively coupled to each other which are executable by the one or more processors 504.
  • the system 502 may be communicatively coupled to the data center 102 and the user device 1016, 108, 110, and 112 via the network 104.
  • the memory 506 may comprise various modules that are communicatively coupled to each other.
  • the various modules may comprise a profile data module 508, a statistical analysis module 510 a clustering algorithm module 512, and a list generator module 514. As discussed above with reference to FIG.
  • the profile data module 508 may be configured to collect profile data of an application, such as a Java application, running in the data center 102, where the profile data may comprise 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one microservice to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with the application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself.
  • the profile data module 508 may further be configured to exclude from the profile data, at least one of application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time.
  • the statistical analysis module 510 may be configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies.
  • the statistical analysis module 510 may select an application method at the bottom of each call stack (bottom application method) in order of when the call stack was created, and may select all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack.
  • the statistical analysis module 510 may group the selected TraceIDs into two groups, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range.
  • the TraceIds with the bottom application method whose inclusive time lays outside the outer fences may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
  • the statistical analysis module 510 may then calculate a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group.
  • the call stacks with the same stack depths or the same first number may be considered to be the same, If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
  • the statistical analysis module 510 may then identify a critical MethodID by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal.
  • the statistical analysis module 510 may repeat the process above until all application methods are evaluated, and may rank critical MethodIDs ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
  • the clustering algorithm module 512 may select an application method based on an occurrence frequency of the application method in a log file of the application.
  • the profile data associated with the selected application method such as associate TraceID, and associated inclusive and exclusive times, are provided to the hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) of the clustering algorithm module 512.
  • the clustering algorithm module 512 may then identify an outlier based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods.
  • the outlier may be identified using a machine-learning-based anomaly detection process.
  • the profile data associated the selected application method may also be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm of the clustering algorithm module 512. Based on results from the DBSCAN algorithm, the clustering algorithm module 512 may identify an outlier having a cluster with a low density.
  • DBSCAN Density Based Clustering of Applications with Noise
  • the clustering algorithm module 510 may repeat the process above until all application methods are evaluated, and may rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
  • the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
  • t incl is the inclusive time of the application method associated with the identified MethodID
  • median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods
  • t excl is the exclusive time of the application method associated with the identified MethodID
  • median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
  • Computer-readable instructions include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like.
  • Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.
  • the computer-readable storage media may include volatile memory (such as random access memory (RAM) ) and/or non-volatile memory (such as read-only memory (ROM) , flash memory, etc. ) .
  • volatile memory such as random access memory (RAM)
  • non-volatile memory such as read-only memory (ROM) , flash memory, etc.
  • the computer-readable storage media may also include additional removable storage and/or non-removable storage including, but not limited to, flash memory, magnetic storage, optical storage, and/or tape storage that may provide non-volatile storage of computer-readable instructions, data structures, program modules, and the like.
  • a non-transient computer-readable storage medium is an example of computer-readable media.
  • Computer-readable media includes at least two types of computer-readable media, namely computer-readable storage media and communications media.
  • Computer-readable storage media includes volatile and non-volatile, removable and non-removable media implemented in any process or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data.
  • Computer-readable storage media includes, but is not limited to, phase change memory (PRAM) , static random-access memory (SRAM) , dynamic random-access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technology, compact disk read-only memory (CD-ROM) , digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.
  • communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer-readable storage media do not include communication media.
  • the computer-readable instructions stored on one or more non-transitory computer-readable storage media that, when executed by one or more processors, may perform operations described above with reference to FIGs. 2-5.
  • computer-readable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
  • the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.
  • a method for identifying an application method, having an abnormal latency, of an application running in a data center comprising: collecting profile data of the application; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
  • a method as paragraph A recites, wherein the application method is a Java application method and the application running in the data center is a Java application.
  • a method as paragraph A recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.
  • the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
  • a method paragraph D recites, wherein the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
  • a method as paragraph E recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created; selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical Method
  • analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods; providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian
  • a method as paragraph G recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: where distance (t incl , t excl ) is the Euclidean distance, t incl is the inclusive time of the application method associated with the identified MethodID, median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods, t excl is the exclusive time of the application method associated with the identified MethodID, and median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
  • One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising: collecting profile data of an application running in a data center; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
  • One or more non-transitory computer-readable storage media as I recites wherein the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
  • the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
  • analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated
  • DBSCAN Density Based Clustering of Applications with Noise
  • the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: wherein: distance (t incl , t excl ) is the Euclidean distance, t incl is the inclusive time of the application method associated with the identified MethodID, median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods, t excl is the exclusive time of the application method associated with the identified MethodID, and median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.
  • a system for identifying an application method having an abnormal latency comprising: one or more processors; and memory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising: a profile data module configured to collect profile data of an application running in a data center and to exclude, from the profile data, at least one of: application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time; a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and a list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
  • the statistical analysis module is further configured to: select a bottom application method at a bottom of each call stack in order of when a call stack is created; select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal
  • the clustering algorithm module is further configured to: select an application method based on an occurrence frequency of the application method in a log file of the application; provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified Method
  • T A system as paragraph S recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: wherein: distance (t incl , t excl ) is the Euclidean distance, t incl is the inclusive time of the application method associated with the identified MethodID, median (t incl ) is a median of inclusive times of the application method in an overall population of the application methods, t excl is the exclusive time of the application method associated with the identified MethodID, and median (t excl ) is a median of exclusive times of the application method in an overall population of the application methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Systems and methods provided herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods. Profile data of an application running in a computing device is collected and a predetermined set of data is excluded from the profile data. Application methods included in remaining profile data are analyzed statistically to identify a first set of application methods having abnormal latencies, and analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods is then generated.

Description

JAVA SOFTWARE LATENCY ANOMALY DETECTION BACKGROUND
As the use of computers and software running on the computers continue to grow in everyday transactions, any latencies in such transactions have a range of consequences. For small scale transactions, such as a personal online shopping, associated consequences may be limited to the individual user being annoyed from a lack of response. For large scale transactions, such as commercial, industrial, or security transactions, the associated consequences may result in a critical commercial or security transaction failing to complete. Additionally, many interacting devices are often updated or replaced with newer software and/or hardware, and unintended incompatibility among the devices may introduce latencies.
Data center software developers and operators may make use of software method latencies to figure out sources of bad response times, where each software method, or application method, is a set of code which is referred to by its own name and can be called or invoked at any point in a program by utilizing the name of the software method. While commercially available tools, such as Java Development Kit (JDK) , can collect application method latencies, analysis of large amount of such data may be difficult.
A Java method tracing capability implemented in a customized JDK product may provide diagnostic and profiling options to track the performance of Java applications via instrumentation. To minimize the overhead incurred by  instrumentation, instructions may be added at the entry and exit of a compiled Java method by modifying interpreter and compilers in a Java Virtual Machine (JVM) , which allows users to track the entry time and exit time of an execution of the Java application method. The Java method tracing capability may also provide application programming interfaces (APIs) for profiling a system to specify which thread could be traced, and allow enabling and disabling of this tracing ability on the fly per thread basis. The Java method tracing capability may allow the user to get comprehensive information regarding a code flow in a Java application.
However, there are challenges, or additional desired capabilities, regarding the Java profiling, such as 1) how to automatically detect anomalous performance behaviors, 2) how to automatically identify a method with abnormal high-latency behavior, 3) how to identify workload or incoming web requests affected by a long-latency behavior in a responsive and timely manner, and 4) how to prioritize identified issues.
BRIEF DESCRIPTION OF THE DRAWINGS
The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.
FIG. 1 illustrates an example environment in which a software latency anomaly may be detected.
FIG. 2 illustrates an example flow diagram for identifying an application method having a software latency anomaly.
FIG. 3 illustrates an example process detailing one of the blocks of FIG. 2.
FIG. 4 illustrates an example process detailing one of the blocks of FIG. 2.
FIG. 5 illustrates an example block diagram of a system for identifying an application method having an abnormal latency.
DETAILED DESCRIPTION
Systems and methods discussed herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods.
The systems and methods discussed herein allow users to automatically identify perceptible performance inefficiency by focusing on latency of an application method without stepping through a code, which may be used by companies running large data centers and using Java as their primary development language.
FIG. 1 illustrates an example environment 100 in which a software latency anomaly in an application may be detected. Software latency may be  interchangeably referred as application latency, and a software method may be interchangeably referred as an application method. A data center 102 (which may be a single computing device or a cloud computing center comprising a plurality of computing devices and/or servers) is communicatively coupled, via a network 104, to a plurality user devices, of which four of the user devices 106, 108, 110, and 112 are shown. The data center 102 and the plurality of user devices 106, 108, 110, and 112 are also communicatively coupled to a monitoring computing device 106 via the network 104. The monitoring computing device 106 is configured to monitor the application running in the data center 102 for abnormal latencies of application methods in the application. The application running in the data center 102 that is monitored by the monitoring computing device 114 may be a Java application.
FIG. 2 illustrates an example flow diagram 200 for identifying an application method having a software latency anomaly.
In block 202, profile data of the application, such as a Java application running in the data center 102, may be collected. The profile data may comprise various parameters associated with the application, such as, but not limited to    1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one service to the next, 2) a ThreadID which  is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval  spent by an application method and one or more called application methods associated with that application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself.
In block 204, a predetermined set of data may be excluded from the profile data for further evaluation. The predetermined set of data to be excluded may comprise 1) application methods having incomplete call stacks, which are known to occur very rarely 2) application standard libraries which are commonly available and used, and not unique to the application running, and 3) application methods having inclusive times that are less than a threshold time, which are considered to be behaving normally.
In block 206, the application methods included in the remaining profile data may be statistically analyzed to identify a first set of application methods having abnormal latencies, and in block 208, the application methods included in the remaining profile data may be analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies may then be generated in block 210 based on at least one of the first set of application methods or the second set of application methods.
FIG. 3 illustrates an example process 300 detailing statistically analysis performed in block 206 of FIG. 2.
In block 302, an application method at the bottom of each call stack (bottom application method) may be selected in order of when the call stack was  created, and in block 304, all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack may be selected. The selected TraceIDs may then be grouped into two groups in block 306, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range. The TraceIds with the bottom application method whose inclusive time lays outside the outer fences (Q1 –3 IQR, Q3 + 3 IQR) may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
In block 308, a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group may be calculated. The call stacks with the same stack depths or the same first number may be considered to be the same. If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
In block 310, a critical MethodID may be identified by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal. In block 312, the process may repeat from block 304 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 314, critical MethodIDs may be ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
FIG. 4 illustrates an example process 400 detailing analysis using clustering algorithms performed in block 208 of FIG. 2.
In block 402, an application method based on an occurrence frequency of the application method in a log file of the application may be selected. In block 404, the profile data associated with the selected application method, such as associate TraceID, and associated inclusive and exclusive times, may be provided to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) . Based on results from the hierarchical clustering algorithm in block 404, an outlier may be identified in block 406 based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods. The outlier may be identified using a machine-learning-based anomaly detection process.
In block 408, the profile data associated the selected application method may be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm, and in block 410, based on results from the DBSCAN algorithm, an outlier having a cluster with a low density may be identified.
In block 412, the process repeats from block 402 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 414, MethodIDs associated with the clusters identified as outliers and associated TraceIDs may be ranked based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
The Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
Figure PCTCN2017100457-appb-000001
Figure PCTCN2017100457-appb-000002
where:
distance (tincl, texcl) is the Euclidean distance,
tincl is the inclusive time of the application method associated with the identified MethodID,
median (tincl) is a median of inclusive times of the application method in an overall population of the application methods,
texcl is the exclusive time of the application method associated with the identified MethodID, and
median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
FIG. 5 illustrates an example block diagram of a system 502 for identifying an application method having an abnormal latency. The system 502 may be, or may reside in, the monitoring computing device 106. The system 502 may comprise one or more processors 504 and memory 506 coupled to the one or more processors 504. The memory 506 may comprise various modules communicatively coupled to each other which are executable by the one or more processors 504. As illustrated in FIG. 1, the system 502 may be communicatively coupled to the data center 102 and the user device 1016, 108, 110, and 112 via the network 104.
The memory 506 may comprise various modules that are communicatively coupled to each other. The various modules may comprise a profile data module 508, a statistical analysis module 510 a clustering algorithm module 512, and a list generator module 514. As discussed above with reference to FIG. 2, the profile data module 508 may be configured to collect profile data of an application, such as a Java application, running in the data center 102, where the profile data may comprise 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one microservice to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data  center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with the application method; and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the  application method itself. The profile data module 508 may further be configured to exclude from the profile data, at least one of application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time.
As discussed above with reference to FIG. 3, the statistical analysis module 510 may be configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies. The statistical analysis module 510 may select an application method at the bottom of each call stack (bottom application method) in order of when the call stack was created, and may select all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack. The statistical analysis module 510 may group the selected TraceIDs into two groups, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range. The TraceIds  with the bottom application method whose inclusive time lays outside the outer fences (Q1 –3 IQR, Q3 + 3 IQR) may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.
The statistical analysis module 510 may then calculate a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group. The call stacks with the same stack depths or the same first number may be considered to be the same, If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.
The statistical analysis module 510 may then identify a critical MethodID by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal. The statistical analysis module 510 may repeat the process above until all application methods are evaluated, and may rank critical MethodIDs ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
As discussed above with reference to FIG. 4, the clustering algorithm module 512 may select an application method based on an occurrence frequency of the application method in a log file of the application. The profile data associated with the selected application method, such as associate TraceID, and associated inclusive and exclusive times, are provided to the hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) of the clustering algorithm module 512. The clustering algorithm module 512 may then identify an outlier based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods. The outlier may be identified using a machine-learning-based anomaly detection process.
The profile data associated the selected application method may also be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm of the clustering algorithm module 512. Based on results from the DBSCAN algorithm, the clustering algorithm module 512 may identify an outlier having a cluster with a low density.
The clustering algorithm module 510 may repeat the process above until all application methods are evaluated, and may rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
The Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:
Figure PCTCN2017100457-appb-000003
Figure PCTCN2017100457-appb-000004
where:
distance (tincl, texcl) is the Euclidean distance,
tincl is the inclusive time of the application method associated with the identified MethodID,
median (tincl) is a median of inclusive times of the application method in an overall population of the application methods,
texcl is the exclusive time of the application method associated with the identified MethodID, and
median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
Some or all operations of the methods described above can be performed by execution of computer-readable instructions stored on a computer-readable storage medium, as defined below. The term “computer-readable instructions” as used in the description and claims, include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held  computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.
The computer-readable storage media may include volatile memory (such as random access memory (RAM) ) and/or non-volatile memory (such as read-only memory (ROM) , flash memory, etc. ) . The computer-readable storage media may also include additional removable storage and/or non-removable storage including, but not limited to, flash memory, magnetic storage, optical storage, and/or tape storage that may provide non-volatile storage of computer-readable instructions, data structures, program modules, and the like.
A non-transient computer-readable storage medium is an example of computer-readable media. Computer-readable media includes at least two types of computer-readable media, namely computer-readable storage media and communications media. Computer-readable storage media includes volatile and non-volatile, removable and non-removable media implemented in any process or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media includes, but is not limited to, phase change memory (PRAM) , static random-access memory (SRAM) , dynamic random-access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technology, compact disk read-only memory (CD-ROM) , digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any  other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer-readable storage media do not include communication media.
The computer-readable instructions stored on one or more non-transitory computer-readable storage media that, when executed by one or more processors, may perform operations described above with reference to FIGs. 2-5. Generally, computer-readable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.
EXAMPLE CLAUSES
A. A method for identifying an application method, having an abnormal latency, of an application running in a data center, the method comprising: collecting profile data of the application; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the  remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
B. A method as paragraph A recites, wherein the application method is a Java application method and the application running in the data center is a Java application.
C. A method as paragraph A recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.
D. A method as paragraph A recites, wherein the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
E. A method paragraph D recites, wherein the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
F. A method as paragraph E recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created; selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
G. A method as paragraph F recites, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting  an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods; providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
H. A method as paragraph G recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: 
Figure PCTCN2017100457-appb-000005
Figure PCTCN2017100457-appb-000006
where distance (tincl, texcl) is the Euclidean distance, tincl is the inclusive time of the application method associated with the identified MethodID, median (tincl) is a median of inclusive times of the application method in an overall population of the application methods, texcl is the exclusive time of the application method  associated with the identified MethodID, and median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
I. One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising: collecting profile data of an application running in a data center; excluding a predetermined set of data from the profile data; statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
J. One or more non-transitory computer-readable storage media as paragraph I recites, wherein the wherein the application method is a Java application method and the application running in the data center is a Java application.
K. One or more non-transitory computer-readable storage media as paragraph I recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.
L. One or more non-transitory computer-readable storage media as I recites, wherein the profile data comprises: a TraceID being a unique identifier of each web request; a ThreadID being a unique numeric identifier of each thread running in the data center; a MethodID being a unique identifier for each application method; an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
M. One or more non-transitory computer-readable storage media as paragraph L recites, the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks; application standard libraries; or application methods having inclusive times that are less than a threshold time.
N. One or more non-transitory computer-readable storage media as paragraph M recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created; selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application  method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
O. One or more non-transitory computer-readable storage media as paragraph N recites, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application; providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; providing profile data associated the selected application method to a Density Based Clustering of  Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
P. One or more non-transitory computer-readable storage media as paragraph O recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: 
Figure PCTCN2017100457-appb-000007
Figure PCTCN2017100457-appb-000008
wherein: distance (tincl, texcl) is the Euclidean distance, tincl is the inclusive time of the application method associated with the identified MethodID, median (tincl) is a median of inclusive times of the application method in an overall population of the application methods, texcl is the exclusive time of the application method associated with the identified MethodID, and median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
Q. A system for identifying an application method having an abnormal latency, the system comprising: one or more processors; and memory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising: a profile data module configured to collect  profile data of an application running in a data center and to exclude, from the profile data, at least one of: application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time; a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies; a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and a list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
R. A system as paragraph Q recites, wherein the statistical analysis module is further configured to: select a bottom application method at a bottom of each call stack in order of when a call stack is created; select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack; group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range; for all application methods in the remaining profile data: calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and  an inclusive time of the application method in a same call stack from the normal TraceID group, and identify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and rank critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
S. A system as paragraph R recites, wherein the clustering algorithm module is further configured to: select an application method based on an occurrence frequency of the application method in a log file of the application; provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ; based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods; provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm; based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
T. A system as paragraph S recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is: 
Figure PCTCN2017100457-appb-000009
Figure PCTCN2017100457-appb-000010
wherein: distance (tincl, texcl) is the Euclidean distance, tincl is the inclusive time of the application method associated with the identified MethodID, median (tincl) is a median of inclusive times of the application method in an overall population of the application methods, texcl is the exclusive time of the application method associated with the identified MethodID, and median (texcl) is a median of exclusive times of the application method in an overall population of the application methods.
CONCLUSION
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.

Claims (21)

  1. A method comprising:
    collecting profile data of an application running in a computing device;
    excluding a predetermined set of data from the profile data; and
    identifying a first set of application methods having abnormal latencies.
  2. A method of claim 1, wherein the application method is a Java application method and the application running in the computing device is a Java application.
  3. A method of claim 1, wherein the computing device comprises one or more computing devices, one or more servers, or a cloud computing center.
  4. A method of claim 1, wherein the profile data comprises:
    a TraceID being a unique identifier of each web request;
    a ThreadID being a unique numeric identifier of each thread running in the computing device;
    a MethodID being a unique identifier for each application method;
    an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and
    an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
  5. A method of claim 4, wherein the predetermined set of data excluded from the profile data comprises at least one of:
    application methods having incomplete call stacks;
    application standard libraries; or
    application methods having inclusive times that are less than a threshold time.
  6. A method of claim 1, wherein identifying the first set of application methods having abnormal latencies comprises:
    statistically analyzing application methods includes in remaining profile data to identify the first set of application methods having abnormal latencies.
  7. A method of claim 6, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises:
    selecting a bottom application method at a bottom of each call stack in order of when a call stack is created;
    selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack;
    grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range;
    for all application methods in the remaining profile data:
    calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and
    identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and
    ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
  8. A method of claim 6, further comprising:
    analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and
    generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
  9. A method of claim 8, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises:
    selecting an application method based on an occurrence frequency of the application method in a log file of the application;
    providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ;
    based on results from the hierarchical clustering algorithm, identifying, as an outlier, a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods;
    providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm;
    based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and
    ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated  TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
  10. One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising:
    collecting profile data of an application running in a computing device;
    excluding a predetermined set of data from the profile data; and
    identifying a first set of application methods having abnormal latencies.
  11. One or more non-transitory computer-readable storage media of claim 10, wherein the application method is a Java application method and the application running in the computing device is a Java application.
  12. One or more non-transitory computer-readable storage media of claim 10, wherein the computing device comprises one or more computing devices, one or more servers, or a cloud computing center.
  13. One or more non-transitory computer-readable storage media of claim 10, wherein the profile data comprises:
    a TraceID being a unique identifier of each web request;
    a ThreadID being a unique numeric identifier of each thread running in the computing device;
    a MethodID being a unique identifier for each application method;
    an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method; and
    an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
  14. One or more non-transitory computer-readable storage media of claim 13, the predetermined set of data excluded from the profile data comprises at least one of:
    application methods having incomplete call stacks;
    application standard libraries; or
    application methods having inclusive times that are less than a threshold time.
  15. One or more non-transitory computer-readable storage media of claim 10, wherein identifying a first set of application methods having abnormal latencies comprise:
    statistically analyzing application methods includes in remaining profile data to identify the first set of application methods having abnormal latencies
  16. One or more non-transitory computer-readable storage media of claim 15, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises:
    selecting a bottom application method at a bottom of each call stack in order of when a call stack is created;
    selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack;
    grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range;
    for all application methods in the remaining profile data:
    calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and
    identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and
    ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
  17. One or more non-transitory computer-readable storage media of claim 16, wherein the operation further comprises:
    analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and
    generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
  18. One or more non-transitory computer-readable storage media of claim 17, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises:
    selecting an application method based on an occurrence frequency of the application method in a log file of the application;
    providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ;
    based on results from the hierarchical clustering algorithm, identifying, as an outlier, a small or sparse cluster and a median exclusive time or median  inclusive time being higher than a predetermined percentile of an overall population of the application methods;
    providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm;
    based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and
    ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
  19. A system for identifying an application method having an abnormal latency, the system comprising:
    one or more processors; and
    memory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising:
    a profile data module configured to collect profile data of an application running in a computing device and to exclude, from the profile data, at least one of:
    application methods having incomplete call stacks,
    application standard libraries, or
    application methods having inclusive times that are less than a threshold time;
    a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies;
    a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods; and
    a list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
  20. A system of claim 19, wherein the statistical analysis module is further configured to:
    select a bottom application method at a bottom of each call stack in order of when a call stack is created;
    select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack;
    group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal  TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range;
    for all application methods in the remaining profile data:
    calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and
    identify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal; and
    rank critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
  21. A system of claim 20, wherein the clustering algorithm module is further configured to:
    select an application method based on an occurrence frequency of the application method in a log file of the application;
    provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ;
    based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods;
    provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm;
    based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density; and
    rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
PCT/CN2017/100457 2017-09-05 2017-09-05 Java software latency anomaly detection WO2019046996A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/100457 WO2019046996A1 (en) 2017-09-05 2017-09-05 Java software latency anomaly detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/100457 WO2019046996A1 (en) 2017-09-05 2017-09-05 Java software latency anomaly detection

Publications (1)

Publication Number Publication Date
WO2019046996A1 true WO2019046996A1 (en) 2019-03-14

Family

ID=65633414

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/100457 WO2019046996A1 (en) 2017-09-05 2017-09-05 Java software latency anomaly detection

Country Status (1)

Country Link
WO (1) WO2019046996A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069354A (en) * 2019-04-15 2019-07-30 必成汇(成都)科技有限公司 The full link trace method of micro services and micro services framework
CN111522900A (en) * 2020-03-18 2020-08-11 携程计算机技术(上海)有限公司 Method, system, device and storage medium for automatically analyzing unstructured data
EP3929782A1 (en) * 2020-06-26 2021-12-29 Acronis International GmbH Systems and methods for detecting behavioral anomalies in applications

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5872976A (en) * 1997-04-01 1999-02-16 Landmark Systems Corporation Client-based system for monitoring the performance of application programs
US7921410B1 (en) * 2007-04-09 2011-04-05 Hewlett-Packard Development Company, L.P. Analyzing and application or service latency

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5872976A (en) * 1997-04-01 1999-02-16 Landmark Systems Corporation Client-based system for monitoring the performance of application programs
US7921410B1 (en) * 2007-04-09 2011-04-05 Hewlett-Packard Development Company, L.P. Analyzing and application or service latency

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069354A (en) * 2019-04-15 2019-07-30 必成汇(成都)科技有限公司 The full link trace method of micro services and micro services framework
CN111522900A (en) * 2020-03-18 2020-08-11 携程计算机技术(上海)有限公司 Method, system, device and storage medium for automatically analyzing unstructured data
CN111522900B (en) * 2020-03-18 2023-09-01 携程计算机技术(上海)有限公司 Automatic analysis method, system, equipment and storage medium for unstructured data
EP3929782A1 (en) * 2020-06-26 2021-12-29 Acronis International GmbH Systems and methods for detecting behavioral anomalies in applications

Similar Documents

Publication Publication Date Title
Borghesi et al. Anomaly detection using autoencoders in high performance computing systems
US10496468B2 (en) Root cause analysis for protection storage devices using causal graphs
US9996409B2 (en) Identification of distinguishable anomalies extracted from real time data streams
US10884891B2 (en) Interactive detection of system anomalies
US10210189B2 (en) Root cause analysis of performance problems
US10002144B2 (en) Identification of distinguishing compound features extracted from real time data streams
Weyuker et al. Comparing the effectiveness of several modeling methods for fault prediction
US11457029B2 (en) Log analysis based on user activity volume
CN110362612B (en) Abnormal data detection method and device executed by electronic equipment and electronic equipment
Solaimani et al. Statistical technique for online anomaly detection using spark over heterogeneous data from multi-source vmware performance data
US20210092160A1 (en) Data set creation with crowd-based reinforcement
EP3069241A1 (en) Application execution path tracing with configurable origin definition
Muallem et al. Hoeffding tree algorithms for anomaly detection in streaming datasets: A survey
US10866804B2 (en) Recommendations based on the impact of code changes
US10929258B1 (en) Method and system for model-based event-driven anomalous behavior detection
WO2019046996A1 (en) Java software latency anomaly detection
CN113965389B (en) Network security management method, device and medium based on firewall log
US20170244595A1 (en) Dynamic data collection profile configuration
Šikić et al. Improving software defect prediction by aggregated change metrics
Naidu et al. Analysis of Hadoop log file in an environment for dynamic detection of threats using machine learning
US20190080251A1 (en) Reward-based recommendations of actions using machine-learning on telemetry data
EP3224724A1 (en) Application management based on data correlations
Yamnual et al. Failure detection through monitoring of the scientific distributed system
CN116149926A (en) Abnormality monitoring method, device, equipment and storage medium for business index
US20170371651A1 (en) Automatically establishing significance of static analysis results

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17924354

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17924354

Country of ref document: EP

Kind code of ref document: A1